Marks some things as done

2026-06-30 17:24:22 +00:00 · 2026-06-12 11:38:20 -07:00
parent b2151acfd5
commit 85cd9b657b
6 changed files with 0 additions and 0 deletions
@@ -0,0 +1,745 @@
+# LanceDB Schema Migration Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Add a schema versioning and migration system to the LanceDB vector store so that structural column changes can be applied in-place without re-embedding documents, avoiding token costs for users on paid embedding APIs.
+
+**Architecture:** A `schema_version.json` file is written alongside the LanceDB data directory and tracks the current applied version. A `Migration` dataclass registry in `vector_store.py` holds ordered, typed migration steps; each migration is classified as `requires_reembed=True/False`. At index update time, structural-only migrations are applied in-place via LanceDB's `add_columns`/`alter_columns`/`drop_columns` APIs; if any pending migration requires re-embedding, the existing model-mismatch rebuild path is reused.
+
+**Tech Stack:** Python 3.11, lancedb 0.33, pyarrow, pytest, pytest-mock, factory-boy
+
+---
+
+## File Map
+
+| File                                          | Change                                                                                                                                |
+| --------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- |
+| `src/paperless_ai/vector_store.py`            | Add `CURRENT_SCHEMA_VERSION`, `Migration` dataclass, version file helpers, migration methods; modify `_ensure_table` and `drop_table` |
+| `src/paperless_ai/indexing.py`                | Call migration inside `update_llm_index`'s `write_store` block                                                                        |
+| `src/paperless_ai/tests/test_vector_store.py` | New `TestSchemaVersioning` and `TestMigrations` test classes                                                                          |
+| `src/paperless_ai/tests/test_ai_indexing.py`  | Two new integration tests for migration path                                                                                          |
+
+---
+
+## Task 1: Schema version file helpers
+
+**Files:**
+
+- Modify: `src/paperless_ai/vector_store.py`
+- Test: `src/paperless_ai/tests/test_vector_store.py`
+
+- [ ] **Step 1: Write the failing tests**
+
+Add a new class at the bottom of `test_vector_store.py`:
+
+```python
+class TestSchemaVersioning:
+    @pytest.fixture
+    def uri(self, tmp_path: Path) -> str:
+        return str(tmp_path / "idx")
+
+    def test_version_file_written_on_table_creation(self, uri: str) -> None:
+        from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
+
+        store = PaperlessLanceVectorStore(uri=uri)
+        store.add([_node("1-0", "1", "text", 0.1)])
+
+        version_file = Path(uri) / "schema_version.json"
+        assert version_file.exists()
+        assert json.loads(version_file.read_text())["version"] == CURRENT_SCHEMA_VERSION
+
+    def test_stored_schema_version_returns_current_when_file_missing(
+        self, uri: str
+    ) -> None:
+        from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
+
+        store = PaperlessLanceVectorStore(uri=uri)
+        store.add([_node("1-0", "1", "text", 0.1)])
+        (Path(uri) / "schema_version.json").unlink()
+
+        reopened = PaperlessLanceVectorStore(uri=uri)
+        assert reopened.stored_schema_version() == CURRENT_SCHEMA_VERSION
+
+    def test_stored_schema_version_persists_after_reopen(self, uri: str) -> None:
+        from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
+
+        PaperlessLanceVectorStore(uri=uri).add([_node("1-0", "1", "text", 0.1)])
+
+        reopened = PaperlessLanceVectorStore(uri=uri)
+        assert reopened.stored_schema_version() == CURRENT_SCHEMA_VERSION
+
+    def test_drop_table_removes_version_file(self, uri: str) -> None:
+        store = PaperlessLanceVectorStore(uri=uri)
+        store.add([_node("1-0", "1", "text", 0.1)])
+        assert (Path(uri) / "schema_version.json").exists()
+
+        store.drop_table()
+        assert not (Path(uri) / "schema_version.json").exists()
+
+    def test_version_file_written_on_upsert_creation(self, uri: str) -> None:
+        from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
+
+        store = PaperlessLanceVectorStore(uri=uri)
+        store.upsert_document("1", [_node("1-0", "1", "text", 0.1)])
+
+        version_file = Path(uri) / "schema_version.json"
+        assert json.loads(version_file.read_text())["version"] == CURRENT_SCHEMA_VERSION
+```
+
+Add `import json` and `import pytest_mock` to the top of `test_vector_store.py`.
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestSchemaVersioning -v"
+```
+
+Expected: all 5 tests fail with `ImportError` or `AttributeError` — `CURRENT_SCHEMA_VERSION` and `stored_schema_version` don't exist yet.
+
+- [ ] **Step 3: Implement the schema version helpers in `vector_store.py`**
+
+After the existing imports and before the `DEFAULT_TABLE_NAME` constant, add:
+
+```python
+import json
+from pathlib import Path
+```
+
+After `DEFAULT_TABLE_NAME = "documents"`, add:
+
+```python
+CURRENT_SCHEMA_VERSION: int = 1
+```
+
+After the `ANN_PQ_SUB_VECTORS` constant, add nothing yet — version methods go on the class.
+
+Inside `PaperlessLanceVectorStore`, add these methods after `stored_model_name`:
+
+```python
+@property
+def _schema_version_path(self) -> Path:
+    return Path(self._uri) / "schema_version.json"
+
+def stored_schema_version(self) -> int:
+    """Return the schema version recorded on disk, or CURRENT_SCHEMA_VERSION if missing.
+
+    Missing means either the table predates versioning or was just created and the
+    write hasn't happened yet — treat conservatively as already current.
+    """
+    try:
+        return int(json.loads(self._schema_version_path.read_text())["version"])
+    except (FileNotFoundError, KeyError, ValueError):
+        return CURRENT_SCHEMA_VERSION
+
+def _write_schema_version(self, version: int) -> None:
+    self._schema_version_path.parent.mkdir(parents=True, exist_ok=True)
+    self._schema_version_path.write_text(json.dumps({"version": version}))
+```
+
+Modify `_ensure_table` to write the version after creating the table. Replace the current method body:
+
+```python
+def _ensure_table(self, rows: list[dict[str, Any]], dim: int) -> bool:
+    if self._table is not None:
+        return False
+    self._table = self._conn.create_table(
+        self._table_name,
+        rows,
+        schema=self._schema(dim, self._embed_model_name),
+    )
+    self._write_schema_version(CURRENT_SCHEMA_VERSION)
+    return True
+```
+
+Modify `drop_table` to also remove the version file:
+
+```python
+def drop_table(self) -> None:
+    if self.table_exists():
+        self._conn.drop_table(self._table_name)
+    self._table = None
+    self._schema_version_path.unlink(missing_ok=True)
+```
+
+- [ ] **Step 4: Run tests to verify they pass**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestSchemaVersioning -v"
+```
+
+Expected: all 5 tests pass.
+
+- [ ] **Step 5: Verify no regressions**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
+```
+
+Expected: all existing tests still pass.
+
+- [ ] **Step 6: Lint**
+
+```bash
+ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+```
+
+Expected: no errors.
+
+- [ ] **Step 7: Commit**
+
+```bash
+git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+git commit -m "feat(ai): add schema version file tracking to LanceDB vector store"
+```
+
+---
+
+## Task 2: Migration dataclass and pending migration detection
+
+**Files:**
+
+- Modify: `src/paperless_ai/vector_store.py`
+- Test: `src/paperless_ai/tests/test_vector_store.py`
+
+- [ ] **Step 1: Write the failing tests**
+
+Add a new class to `test_vector_store.py`:
+
+```python
+class TestMigrationRegistry:
+    @pytest.fixture
+    def uri(self, tmp_path: Path) -> str:
+        return str(tmp_path / "idx")
+
+    def _store_at_version(self, uri: str, version: int) -> PaperlessLanceVectorStore:
+        """Create a store with a table and then fake its on-disk version."""
+        store = PaperlessLanceVectorStore(uri=uri)
+        store.add([_node("1-0", "1", "text", 0.1)])
+        store._write_schema_version(version)
+        return PaperlessLanceVectorStore(uri=uri)  # reopen to pick up written version
+
+    def test_pending_migrations_empty_at_current_version(self, uri: str) -> None:
+        from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION, Migration
+
+        store = self._store_at_version(uri, CURRENT_SCHEMA_VERSION)
+        assert store.pending_migrations() == []
+
+    def test_pending_migrations_returns_migrations_above_stored_version(
+        self, uri: str, mocker: pytest_mock.MockerFixture
+    ) -> None:
+        from paperless_ai.vector_store import Migration
+
+        m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
+        m3 = Migration(version=3, description="reindex", requires_reembed=True, apply=lambda t: None)
+        mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
+
+        store = self._store_at_version(uri, 1)
+        pending = store.pending_migrations()
+        assert pending == [m2, m3]
+
+    def test_pending_migrations_excludes_already_applied(
+        self, uri: str, mocker: pytest_mock.MockerFixture
+    ) -> None:
+        from paperless_ai.vector_store import Migration
+
+        m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
+        m3 = Migration(version=3, description="reindex", requires_reembed=True, apply=lambda t: None)
+        mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
+
+        store = self._store_at_version(uri, 2)
+        pending = store.pending_migrations()
+        assert pending == [m3]
+
+    def test_pending_migrations_empty_when_no_table(self, uri: str) -> None:
+        store = PaperlessLanceVectorStore(uri=uri)
+        assert store.pending_migrations() == []
+
+    def test_requires_reembed_migration_false_when_none_pending(self, uri: str) -> None:
+        store = self._store_at_version(uri, 1)
+        assert store.requires_reembed_migration() is False
+
+    def test_requires_reembed_migration_false_when_only_structural_pending(
+        self, uri: str, mocker: pytest_mock.MockerFixture
+    ) -> None:
+        from paperless_ai.vector_store import Migration
+
+        m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: None)
+        mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
+
+        store = self._store_at_version(uri, 1)
+        assert store.requires_reembed_migration() is False
+
+    def test_requires_reembed_migration_true_when_reembed_migration_pending(
+        self, uri: str, mocker: pytest_mock.MockerFixture
+    ) -> None:
+        from paperless_ai.vector_store import Migration
+
+        m2 = Migration(version=2, description="reindex", requires_reembed=True, apply=lambda t: None)
+        mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
+
+        store = self._store_at_version(uri, 1)
+        assert store.requires_reembed_migration() is True
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMigrationRegistry -v"
+```
+
+Expected: all 7 tests fail — `Migration`, `MIGRATIONS`, `pending_migrations`, `requires_reembed_migration` don't exist yet.
+
+- [ ] **Step 3: Add `Migration` dataclass and registry to `vector_store.py`**
+
+Add near the top of the file, after the existing imports:
+
+```python
+from dataclasses import dataclass, field
+from typing import Callable
+```
+
+After the `CURRENT_SCHEMA_VERSION` constant, add:
+
+```python
+@dataclass(frozen=True)
+class Migration:
+    version: int
+    description: str
+    requires_reembed: bool
+    apply: Callable[[Any], None] = field(compare=False, hash=False)
+```
+
+(`compare=False, hash=False` excludes `apply` from `__eq__` and `__hash__` — equality is driven by `version` alone, which is the natural identity key. This avoids lambda identity issues in tests and makes the API safe for callers that construct `Migration` instances inline.)
+
+# Ordered list of schema migrations. Each entry upgrades the table to `version`.
+
+# Structural migrations (requires_reembed=False) are applied in-place via LanceDB's
+
+# add_columns/alter_columns/drop_columns APIs — no re-embedding needed.
+
+# Migrations with requires_reembed=True cause a full rebuild on next index update,
+
+# exactly like a model-name change does today.
+
+#
+
+# To add a migration:
+
+# 1. Increment CURRENT_SCHEMA_VERSION.
+
+# 2. Append a Migration entry here with the new version number.
+
+# 3. For structural changes, call table.add_columns/alter_columns/drop_columns in apply().
+
+# 4. For embedding-invalidating changes, set requires_reembed=True; apply() can be a no-op.
+
+MIGRATIONS: list[Migration] = []
+
+````
+
+Inside `PaperlessLanceVectorStore`, add after `requires_reembed_migration` (which we'll add next):
+
+```python
+def pending_migrations(self) -> list[Migration]:
+    """Return migrations not yet applied to this table, in version order."""
+    if self._table is None:
+        return []
+    current = self.stored_schema_version()
+    return [m for m in MIGRATIONS if m.version > current]
+
+def requires_reembed_migration(self) -> bool:
+    """True when any pending migration requires a full re-embedding."""
+    return any(m.requires_reembed for m in self.pending_migrations())
+````
+
+- [ ] **Step 4: Run tests to verify they pass**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMigrationRegistry -v"
+```
+
+Expected: all 7 tests pass.
+
+- [ ] **Step 5: Lint**
+
+```bash
+ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+```
+
+- [ ] **Step 6: Commit**
+
+```bash
+git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+git commit -m "feat(ai): add Migration registry and pending migration detection"
+```
+
+---
+
+## Task 3: Apply structural migrations in-place
+
+**Files:**
+
+- Modify: `src/paperless_ai/vector_store.py`
+- Test: `src/paperless_ai/tests/test_vector_store.py`
+
+- [ ] **Step 1: Write the failing tests**
+
+Add a new class to `test_vector_store.py`:
+
+```python
+class TestApplyStructuralMigrations:
+    @pytest.fixture
+    def uri(self, tmp_path: Path) -> str:
+        return str(tmp_path / "idx")
+
+    def _store_at_version(self, uri: str, version: int) -> PaperlessLanceVectorStore:
+        store = PaperlessLanceVectorStore(uri=uri)
+        store.add([_node("1-0", "1", "text", 0.1)])
+        store._write_schema_version(version)
+        return PaperlessLanceVectorStore(uri=uri)
+
+    def test_apply_structural_adds_column_via_lancedb(
+        self, uri: str, mocker: pytest_mock.MockerFixture
+    ) -> None:
+        from paperless_ai.vector_store import Migration
+
+        def _add_extra(table: Any) -> None:
+            table.add_columns({"extra": "CAST(NULL AS VARCHAR)"})
+
+        m2 = Migration(version=2, description="add extra col", requires_reembed=False, apply=_add_extra)
+        mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
+
+        store = self._store_at_version(uri, 1)
+        applied = store.apply_structural_migrations()
+
+        assert len(applied) == 1
+        assert applied[0] == m2
+        # Column actually present in the table schema.
+        reopened = PaperlessLanceVectorStore(uri=uri)
+        field_names = [f.name for f in reopened._table.schema]
+        assert "extra" in field_names
+
+    def test_apply_structural_updates_version_file(
+        self, uri: str, mocker: pytest_mock.MockerFixture
+    ) -> None:
+        from paperless_ai.vector_store import Migration
+
+        m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: t.add_columns({"c": "CAST(NULL AS VARCHAR)"}))
+        mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
+
+        store = self._store_at_version(uri, 1)
+        store.apply_structural_migrations()
+
+        assert store.stored_schema_version() == 2
+
+    def test_apply_structural_skips_reembed_migrations(
+        self, uri: str, mocker: pytest_mock.MockerFixture
+    ) -> None:
+        from paperless_ai.vector_store import Migration
+
+        applied_versions: list[int] = []
+        m2 = Migration(version=2, description="structural", requires_reembed=False, apply=lambda t: applied_versions.append(2) or t.add_columns({"c": "CAST(NULL AS VARCHAR)"}))
+        m3 = Migration(version=3, description="reembed", requires_reembed=True, apply=lambda t: applied_versions.append(3))
+        mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2, m3])
+
+        store = self._store_at_version(uri, 1)
+        applied = store.apply_structural_migrations()
+
+        assert [m.version for m in applied] == [2]
+        assert 3 not in applied_versions
+        # Version advances only to the last structural migration applied.
+        assert store.stored_schema_version() == 2
+
+    def test_apply_structural_noop_at_current_version(self, uri: str) -> None:
+        store = self._store_at_version(uri, 1)
+        applied = store.apply_structural_migrations()
+        assert applied == []
+
+    def test_apply_structural_noop_when_no_table(self, uri: str) -> None:
+        store = PaperlessLanceVectorStore(uri=uri)
+        applied = store.apply_structural_migrations()
+        assert applied == []
+
+    def test_apply_structural_refreshes_table_reference(
+        self, uri: str, mocker: pytest_mock.MockerFixture
+    ) -> None:
+        """After add_columns the in-memory table object must reflect the new schema."""
+        from paperless_ai.vector_store import Migration
+
+        m2 = Migration(version=2, description="add col", requires_reembed=False, apply=lambda t: t.add_columns({"extra": "CAST(NULL AS VARCHAR)"}))
+        mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
+
+        store = self._store_at_version(uri, 1)
+        store.apply_structural_migrations()
+
+        # The store's own _table reference (not a re-open) must see the new column.
+        field_names = [f.name for f in store._table.schema]
+        assert "extra" in field_names
+```
+
+Add `from typing import Any` to the test file imports if not already present.
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestApplyStructuralMigrations -v"
+```
+
+Expected: all 6 tests fail — `apply_structural_migrations` doesn't exist yet.
+
+- [ ] **Step 3: Implement `apply_structural_migrations` in `vector_store.py`**
+
+Add after `requires_reembed_migration` on the class:
+
+```python
+def apply_structural_migrations(self) -> list[Migration]:
+    """Apply all pending structural (non-reembed) migrations in version order.
+
+    Each applied migration's ``apply`` callable receives the live LanceDB table
+    object and should call ``add_columns``, ``alter_columns``, or ``drop_columns``
+    as needed.  After all structural migrations run, the version file is updated
+    to the highest version applied and the in-memory table reference is refreshed.
+
+    Migrations with ``requires_reembed=True`` are skipped — the caller is
+    responsible for detecting them via ``requires_reembed_migration()`` and
+    triggering a full rebuild.
+    """
+    if self._table is None:
+        return []
+    structural = [m for m in self.pending_migrations() if not m.requires_reembed]
+    if not structural:
+        return []
+    for migration in structural:
+        logger.info("Applying schema migration v%d: %s", migration.version, migration.description)
+        migration.apply(self._table)
+    # Refresh the in-memory table so subsequent operations see the new schema.
+    self._table = self._conn.open_table(self._table_name)
+    self._write_schema_version(structural[-1].version)
+    return structural
+```
+
+- [ ] **Step 4: Run tests to verify they pass**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestApplyStructuralMigrations -v"
+```
+
+Expected: all 6 tests pass.
+
+- [ ] **Step 5: Full test_vector_store regression check**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
+```
+
+Expected: all tests pass.
+
+- [ ] **Step 6: Lint**
+
+```bash
+ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+```
+
+- [ ] **Step 7: Commit**
+
+```bash
+git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py
+git commit -m "feat(ai): implement apply_structural_migrations for in-place schema changes"
+```
+
+---
+
+## Task 4: Wire migrations into `update_llm_index`
+
+**Files:**
+
+- Modify: `src/paperless_ai/indexing.py`
+- Test: `src/paperless_ai/tests/test_ai_indexing.py`
+
+- [ ] **Step 1: Write the failing tests**
+
+Add these two tests to `test_ai_indexing.py`, after the existing `test_update_llm_index_rebuilds_on_model_name_change` test:
+
+```python
+@pytest.mark.django_db
+def test_update_llm_index_applies_structural_migration_without_rebuild(
+    temp_llm_index_dir: Path,
+    real_document: Document,
+    mock_embed_model: FakeEmbedding,
+    mocker: pytest_mock.MockerFixture,
+) -> None:
+    """Structural migrations are applied in-place; no full rebuild (drop) occurs."""
+    from paperless_ai.vector_store import Migration, PaperlessLanceVectorStore
+
+    column_added: list[bool] = []
+
+    def _add_extra(table) -> None:
+        table.add_columns({"extra": "CAST(NULL AS VARCHAR)"})
+        column_added.append(True)
+
+    # Build the initial index at version 1 (the real CURRENT_SCHEMA_VERSION; no patches needed).
+    with patch("documents.models.Document.objects.all") as mock_all:
+        mock_queryset = MagicMock()
+        mock_queryset.exists.return_value = True
+        mock_queryset.__iter__.return_value = iter([real_document])
+        mock_all.return_value = mock_queryset
+        indexing.update_llm_index(rebuild=True)
+
+    # Simulate a new v2 structural migration being introduced after the initial index was built.
+    m2 = Migration(version=2, description="add extra col", requires_reembed=False, apply=_add_extra)
+    mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
+    mocker.patch("paperless_ai.vector_store.CURRENT_SCHEMA_VERSION", 2)
+    drop_spy = mocker.spy(PaperlessLanceVectorStore, "drop_table")
+
+    with patch("documents.models.Document.objects.all") as mock_all:
+        mock_queryset = MagicMock()
+        mock_queryset.exists.return_value = True
+        mock_queryset.__iter__.return_value = iter([real_document])
+        mock_all.return_value = mock_queryset
+        indexing.update_llm_index(rebuild=False)
+
+    assert column_added, "Structural migration apply() was not called"
+    drop_spy.assert_not_called()
+
+
+@pytest.mark.django_db
+def test_update_llm_index_forces_rebuild_on_reembed_migration(
+    temp_llm_index_dir: Path,
+    real_document: Document,
+    mock_embed_model: FakeEmbedding,
+    mocker: pytest_mock.MockerFixture,
+) -> None:
+    """A pending reembed migration causes a full drop+rebuild on next update."""
+    from paperless_ai.vector_store import Migration, PaperlessLanceVectorStore
+
+    # Build the initial index at version 1 (the real CURRENT_SCHEMA_VERSION; no patches needed).
+    with patch("documents.models.Document.objects.all") as mock_all:
+        mock_queryset = MagicMock()
+        mock_queryset.exists.return_value = True
+        mock_queryset.__iter__.return_value = iter([real_document])
+        mock_all.return_value = mock_queryset
+        indexing.update_llm_index(rebuild=True)
+
+    # Simulate a reembed migration at v2 being introduced after the initial index was built.
+    m2 = Migration(version=2, description="requires reembed", requires_reembed=True, apply=lambda t: None)
+    mocker.patch("paperless_ai.vector_store.MIGRATIONS", [m2])
+    mocker.patch("paperless_ai.vector_store.CURRENT_SCHEMA_VERSION", 2)
+    drop_spy = mocker.spy(PaperlessLanceVectorStore, "drop_table")
+
+    with patch("documents.models.Document.objects.all") as mock_all:
+        mock_queryset = MagicMock()
+        mock_queryset.exists.return_value = True
+        mock_queryset.__iter__.return_value = iter([real_document])
+        mock_all.return_value = mock_queryset
+        indexing.update_llm_index(rebuild=False)
+
+    drop_spy.assert_called()
+```
+
+- [ ] **Step 2: Run tests to verify they fail**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_applies_structural_migration_without_rebuild src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_forces_rebuild_on_reembed_migration -v"
+```
+
+Expected: both tests fail because `update_llm_index` doesn't call migration methods yet.
+
+- [ ] **Step 3: Add migration check inside `update_llm_index` in `indexing.py`**
+
+Inside the `with write_store(embed_model_name=model_name) as store:` block in `update_llm_index`, insert the migration check immediately before the `if rebuild or not store.table_exists():` line:
+
+```python
+        if not rebuild and store.table_exists():
+            store.apply_structural_migrations()
+            if store.requires_reembed_migration():
+                logger.warning("Schema migration requires re-embedding; forcing LLM index rebuild.")
+                rebuild = True
+```
+
+The relevant section of `update_llm_index` should now look like:
+
+```python
+    with write_store(embed_model_name=model_name) as store:
+        if not rebuild and store.table_exists():
+            store.apply_structural_migrations()
+            if store.requires_reembed_migration():
+                logger.warning("Schema migration requires re-embedding; forcing LLM index rebuild.")
+                rebuild = True
+        if rebuild or not store.table_exists():
+            (settings.LLM_INDEX_DIR / "meta.json").unlink(missing_ok=True)
+            logger.info("Rebuilding LLM index.")
+            store.drop_table()
+            ...
+```
+
+- [ ] **Step 4: Run new tests to verify they pass**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_applies_structural_migration_without_rebuild src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_forces_rebuild_on_reembed_migration -v"
+```
+
+Expected: both tests pass.
+
+- [ ] **Step 5: Full indexing regression check**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
+```
+
+Expected: all existing tests still pass.
+
+- [ ] **Step 6: Full AI module test run**
+
+```bash
+bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/ -v"
+```
+
+Expected: all tests pass.
+
+- [ ] **Step 7: Lint**
+
+```bash
+ruff check src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
+ruff format src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
+```
+
+- [ ] **Step 8: Commit**
+
+```bash
+git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
+git commit -m "feat(ai): wire schema migrations into update_llm_index; structural changes avoid re-embed"
+```
+
+---
+
+## How to add a migration (reference for future developers)
+
+When a future schema change is needed:
+
+1. Increment `CURRENT_SCHEMA_VERSION` in `vector_store.py`.
+2. Append a `Migration` to `MIGRATIONS` with the new version number.
+3. If the change is **structural only** (add/rename/drop a column, no embedding content changed):
+   - Set `requires_reembed=False`
+   - In `apply`, call `table.add_columns({"col": "CAST(NULL AS string)"})`, `table.drop_columns(["col"])`, or `table.alter_columns({"path": "col", "rename": "new_name"})` as appropriate.
+4. If the change affects **what text gets embedded** (new fields in `build_llm_index_text`, chunk size change baked into schema, etc.):
+   - Set `requires_reembed=True`
+   - `apply` can be a no-op (`lambda t: None`) — the framework will trigger a full rebuild.
+5. Write tests for the migration in `test_vector_store.py` following the `TestApplyStructuralMigrations` patterns.
+
+Example structural migration adding a `language` column:
+
+```python
+CURRENT_SCHEMA_VERSION: int = 2
+
+MIGRATIONS: list[Migration] = [
+    Migration(
+        version=2,
+        description="Add language column for future locale-aware filtering",
+        requires_reembed=False,
+        apply=lambda table: table.add_columns({"language": "CAST(NULL AS string)"}),
+    ),
+]
+```
@@ -0,0 +1,446 @@
+# Node Metadata Enrichment Implementation Plan
+
+> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
+
+**Goal:** Move `filename`, `storage_path`, and `archive_serial_number` from the LanceDB embedding text into `node.metadata`, and register a schema migration that triggers an automatic index rebuild on upgrade.
+
+**Architecture:** Three small, independent changes to two source files, tested first. The migration is a no-op `apply` (the rebuild regenerates all nodes with correct metadata). All three tests go red first, then each implementation makes them green.
+
+**Tech Stack:** pytest, pytest-django, pytest-mock, factory_boy, llama_index `MetadataMode`, `feature-lancedb-schema-migrate` branch (must be the base branch for this work).
+
+**Branch base:** `feature-lancedb-schema-migrate`
+
+---
+
+### Task 1: Fail — embedding text no longer contains the three fields
+
+**Files:**
+
+- Modify: `src/paperless_ai/tests/test_embedding.py`
+
+- [ ] **Step 1: Update `mock_document` fixture to set an explicit `storage_path`**
+
+  The fixture currently doesn't set `storage_path`, so the existing code path (`doc.storage_path.name if doc.storage_path else ''`) would call `.name` on a `MagicMock`. Give it an explicit value so assertions are unambiguous.
+
+  Add these two lines to the `mock_document` fixture after `doc.archive_serial_number = "12345"`:
+
+  ```python
+  doc.storage_path = MagicMock()
+  doc.storage_path.name = "Finance/Bills"
+  ```
+
+- [ ] **Step 2: Update `test_build_llm_index_text` — flip and add assertions**
+
+  The existing test asserts these fields ARE in the result. Change them to assert they are NOT, and add the two missing ones:
+
+  ```python
+  # was: assert "Filename: test_file.pdf" in result
+  assert "Filename: test_file.pdf" not in result
+  assert "Storage Path: Finance/Bills" not in result
+  assert "Archive Serial Number: 12345" not in result
+  ```
+
+  The assertions for `Notes`, `Content`, and `Custom Field` lines are unchanged — leave them as-is.
+
+- [ ] **Step 3: Run the test to confirm it fails**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py::test_build_llm_index_text -v"
+  ```
+
+  Expected: `FAILED` — `AssertionError: assert 'Filename: test_file.pdf' not in '...'`
+
+---
+
+### Task 2: Pass — remove the three fields from `build_llm_index_text`
+
+**Files:**
+
+- Modify: `src/paperless_ai/embedding.py`
+
+- [ ] **Step 1: Remove the three lines and the TODO comment**
+
+  Current `build_llm_index_text` (lines 114–133). Replace the function body:
+
+  ```python
+  def build_llm_index_text(doc: Document) -> str:
+      lines = [
+          f"Notes: {','.join([str(c.note) for c in Note.objects.filter(document=doc)])}",
+      ]
+
+      for instance in doc.custom_fields.all():
+          lines.append(f"Custom Field - {instance.field.name}: {instance}")
+
+      lines.append("\nContent:\n")
+      lines.append(doc.content or "")
+
+      return _normalize_llm_index_text("\n".join(lines))
+  ```
+
+- [ ] **Step 2: Run the test to confirm it passes**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py::test_build_llm_index_text -v"
+  ```
+
+  Expected: `PASSED`
+
+- [ ] **Step 3: Run the full embedding test module to catch regressions**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_embedding.py -v"
+  ```
+
+  Expected: all green.
+
+- [ ] **Step 4: Commit**
+
+  ```bash
+  git add src/paperless_ai/embedding.py src/paperless_ai/tests/test_embedding.py
+  git commit -m "refactor(ai): remove filename/storage_path/asn from embedding text"
+  ```
+
+---
+
+### Task 3: Fail — `build_document_node` exposes the three fields in metadata
+
+**Files:**
+
+- Modify: `src/paperless_ai/tests/test_ai_indexing.py`
+
+- [ ] **Step 1: Extend `test_build_document_node_structured_fields_in_metadata`**
+
+  This test already checks for `title`, `tags`, etc. Add the three new keys. The `real_document` fixture creates a document with no storage path set, so `storage_path` will be `None` — the key must still be present.
+
+  Replace the existing test body:
+
+  ```python
+  @pytest.mark.django_db
+  def test_build_document_node_structured_fields_in_metadata(
+      real_document: Document,
+  ) -> None:
+      """Structured fields must be in node.metadata so the LLM receives them via metadata prepend."""
+      nodes = indexing.build_document_node(real_document)
+      assert len(nodes) > 0
+      for node in nodes:
+          assert "title" in node.metadata
+          assert "tags" in node.metadata
+          assert "correspondent" in node.metadata
+          assert "document_type" in node.metadata
+          assert "created" in node.metadata
+          assert "added" in node.metadata
+          assert "modified" in node.metadata
+          assert "filename" in node.metadata
+          assert "storage_path" in node.metadata        # None is fine; key must exist
+          assert "archive_serial_number" in node.metadata
+  ```
+
+- [ ] **Step 2: Add a test that storage_path carries the name when set**
+
+  Add a new test function after `test_build_document_node_structured_fields_in_metadata`:
+
+  ```python
+  @pytest.mark.django_db
+  def test_build_document_node_storage_path_name_in_metadata() -> None:
+      """storage_path metadata value is the StoragePath name, not None, when set."""
+      from documents.tests.factories import DocumentFactory, StoragePathFactory
+
+      sp = StoragePathFactory(name="Finance/Bills")
+      doc = DocumentFactory(storage_path=sp)
+
+      nodes = indexing.build_document_node(doc)
+
+      assert len(nodes) > 0
+      for node in nodes:
+          assert node.metadata["storage_path"] == "Finance/Bills"
+  ```
+
+- [ ] **Step 3: Add a test that all three new fields are in `excluded_embed_metadata_keys`**
+
+  Add after the previous test:
+
+  ```python
+  @pytest.mark.django_db
+  def test_build_document_node_new_fields_excluded_from_embedding(
+      real_document: Document,
+  ) -> None:
+      """filename, storage_path, and archive_serial_number must not appear in embedding text."""
+      from llama_index.core.schema import MetadataMode
+
+      nodes = indexing.build_document_node(real_document)
+      assert len(nodes) > 0
+      for node in nodes:
+          assert "filename" in node.excluded_embed_metadata_keys
+          assert "storage_path" in node.excluded_embed_metadata_keys
+          assert "archive_serial_number" in node.excluded_embed_metadata_keys
+          embed_text = node.get_content(metadata_mode=MetadataMode.EMBED)
+          assert "filename" not in embed_text
+          assert "storage_path" not in embed_text
+          assert "archive_serial_number" not in embed_text
+  ```
+
+- [ ] **Step 4: Run the new tests to confirm they fail**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_structured_fields_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_storage_path_name_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_new_fields_excluded_from_embedding -v"
+  ```
+
+  Expected: all `FAILED` — keys not yet in `node.metadata`.
+
+---
+
+### Task 4: Pass — add the three fields to `build_document_node`
+
+**Files:**
+
+- Modify: `src/paperless_ai/indexing.py`
+
+- [ ] **Step 1: Update the `metadata` dict in `build_document_node`**
+
+  Current metadata dict starts at line 106. Replace it:
+
+  ```python
+  metadata = {
+      "document_id": str(document.id),
+      "title": document.title,
+      "filename": document.filename or "",
+      "storage_path": document.storage_path.name if document.storage_path else None,
+      "archive_serial_number": document.archive_serial_number,
+      "tags": [t.name for t in document.tags.all()],
+      "correspondent": document.correspondent.name
+      if document.correspondent
+      else None,
+      "document_type": document.document_type.name
+      if document.document_type
+      else None,
+      "created": document.created.isoformat() if document.created else None,
+      "added": document.added.isoformat() if document.added else None,
+      "modified": document.modified.isoformat(),
+  }
+  ```
+
+- [ ] **Step 2: Update `excluded_embed_metadata_keys`**
+
+  The `LlamaDocument(...)` call currently has:
+
+  ```python
+  excluded_embed_metadata_keys=list(metadata.keys()),
+  ```
+
+  This already excludes all keys, so no change needed here — the new keys are automatically included since they're in the dict. Verify `excluded_llm_metadata_keys` still only excludes `"document_id"`:
+
+  ```python
+  excluded_llm_metadata_keys=["document_id"],
+  ```
+
+  No change needed.
+
+- [ ] **Step 3: Run the failing tests to confirm they pass**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_structured_fields_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_storage_path_name_in_metadata src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_new_fields_excluded_from_embedding -v"
+  ```
+
+  Expected: all `PASSED`.
+
+- [ ] **Step 4: Run the full indexing test module**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
+  ```
+
+  Expected: all green.
+
+- [ ] **Step 5: Commit**
+
+  ```bash
+  git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py
+  git commit -m "feat(ai): add filename/storage_path/asn to node metadata"
+  ```
+
+---
+
+### Task 5: Fail — migration v2 is registered
+
+**Files:**
+
+- Modify: `src/paperless_ai/tests/test_vector_store.py`
+
+These tests use the real (non-mocked) `MIGRATIONS` list, so they go red until the migration is registered in Task 6.
+
+- [ ] **Step 1: Add a `TestMetadataEnrichmentMigration` class**
+
+  Add this class near the end of `test_vector_store.py`, before the final `TestApplyStructuralMigrations`:
+
+  ```python
+  class TestMetadataEnrichmentMigration:
+      def test_current_schema_version_is_2(self) -> None:
+          from paperless_ai.vector_store import CURRENT_SCHEMA_VERSION
+          assert CURRENT_SCHEMA_VERSION == 2
+
+      def test_migration_v2_registered(self) -> None:
+          from paperless_ai.vector_store import MIGRATIONS
+          assert len(MIGRATIONS) == 1
+          assert MIGRATIONS[0].version == 2
+          assert MIGRATIONS[0].requires_reembed is True
+
+      def test_store_at_v1_requires_reembed(self, uri: str) -> None:
+          store = _store_at_version(uri, 1)
+          assert store.requires_reembed_migration() is True
+
+      def test_store_at_v2_no_pending_migrations(self, uri: str) -> None:
+          store = _store_at_version(uri, 2)
+          assert store.pending_migrations() == []
+          assert store.requires_reembed_migration() is False
+  ```
+
+- [ ] **Step 2: Run the tests to confirm they fail**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMetadataEnrichmentMigration -v"
+  ```
+
+  Expected: all `FAILED` — `CURRENT_SCHEMA_VERSION` is still 1 and `MIGRATIONS` is still empty.
+
+---
+
+### Task 6: Pass — register migration v2 in `vector_store.py`
+
+**Files:**
+
+- Modify: `src/paperless_ai/vector_store.py`
+
+- [ ] **Step 1: Add the migration and bump the version constant**
+
+  On the `feature-lancedb-schema-migrate` branch, `vector_store.py` has:
+
+  ```python
+  CURRENT_SCHEMA_VERSION: Final[int] = 1
+  ...
+  MIGRATIONS: list[Migration] = []
+  ```
+
+  Change both:
+
+  ```python
+  CURRENT_SCHEMA_VERSION: Final[int] = 2
+
+  MIGRATIONS: list[Migration] = [
+      Migration(
+          version=2,
+          description="move filename/storage_path/asn from embedding text to metadata; rebuild required",
+          requires_reembed=True,
+          apply=lambda table: None,
+      ),
+  ]
+  ```
+
+- [ ] **Step 2: Run the migration tests to confirm they pass**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py::TestMetadataEnrichmentMigration -v"
+  ```
+
+  Expected: all `PASSED`.
+
+- [ ] **Step 3: Run the full vector store test module**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_vector_store.py -v"
+  ```
+
+  Expected: all green. In particular, `TestSchemaVersioning::test_stored_schema_version_persists_after_reopen` and the `TestMigrationRegistry` tests should still pass — they use `CURRENT_SCHEMA_VERSION` as the baseline.
+
+---
+
+### Task 7: Integration — `update_llm_index` rebuilds when schema version is stale
+
+**Files:**
+
+- Modify: `src/paperless_ai/tests/test_ai_indexing.py`
+
+- [ ] **Step 1: Write the failing integration test**
+
+  Add this test near `test_update_llm_index_rebuilds_on_model_name_change`:
+
+  ```python
+  @pytest.mark.django_db
+  def test_update_llm_index_rebuilds_on_pending_reembed_migration(
+      temp_llm_index_dir: Path,
+      real_document: Document,
+      mock_embed_model: FakeEmbedding,
+  ) -> None:
+      """A stale schema version (v1) must trigger a full rebuild on the next index run."""
+      from paperless_ai.vector_store import PaperlessLanceVectorStore
+
+      # Build an initial index and then rewind the schema version to 1 to simulate
+      # an index created before migration v2 was registered.
+      indexing.update_llm_index(rebuild=True)
+      store = indexing.get_vector_store()
+      store._write_schema_version(1)
+
+      # An incremental run (rebuild=False) must detect the stale version and rebuild.
+      with patch("documents.models.Document.objects.all") as mock_all:
+          mock_queryset = MagicMock()
+          mock_queryset.exists.return_value = True
+          mock_queryset.__iter__.return_value = iter([real_document])
+          mock_all.return_value = mock_queryset
+          indexing.update_llm_index(rebuild=False)
+
+      # After rebuild the schema version must be current.
+      reopened = PaperlessLanceVectorStore(uri=str(temp_llm_index_dir))
+      assert reopened.stored_schema_version() == 2
+  ```
+
+- [ ] **Step 2: Run the test to confirm it fails**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_rebuilds_on_pending_reembed_migration -v"
+  ```
+
+  Expected: `FAILED` — schema version stays at 1 because migration v2 isn't registered yet.
+
+  _(If it passes already because `update_llm_index` detects a different condition, verify the assertion is actually exercising the migration path and not the model-name path.)_
+
+- [ ] **Step 3: Run the test again now that migration v2 is registered (Task 6)**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py::test_update_llm_index_rebuilds_on_pending_reembed_migration -v"
+  ```
+
+  Expected: `PASSED`.
+
+- [ ] **Step 4: Run the full indexing test module**
+
+  ```
+  bash /c/Users/tholmes/Documents/Coding/paperless/vmtest.sh "src/paperless_ai/tests/test_ai_indexing.py -v"
+  ```
+
+  Expected: all green.
+
+- [ ] **Step 5: Final commit**
+
+  ```bash
+  git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py src/paperless_ai/tests/test_ai_indexing.py
+  git commit -m "feat(ai): register schema migration v2; triggers rebuild for metadata enrichment"
+  ```
+
+---
+
+## Self-review checklist
+
+**Spec coverage:**
+
+- ✅ `build_llm_index_text` — three lines removed (Tasks 1–2)
+- ✅ `build_document_node` — three fields added to metadata + excluded_embed_metadata_keys (Tasks 3–4)
+- ✅ Migration v2 registered with `requires_reembed=True` and no-op apply (Tasks 5–6)
+- ✅ `update_llm_index` triggers rebuild on stale schema (Task 7)
+- ✅ Tests: `test_embedding.py`, `test_ai_indexing.py`, `test_vector_store.py`
+
+**Placeholder scan:** None found. Every step has exact code or exact commands.
+
+**Type consistency:**
+
+- `metadata` dict key names (`"filename"`, `"storage_path"`, `"archive_serial_number"`) used consistently across Tasks 1–4.
+- `CURRENT_SCHEMA_VERSION = 2` and `MIGRATIONS[0].version == 2` are consistent across Tasks 5–6.
+- `_store_at_version` and `_node` helpers referenced in Task 5 are defined in the existing `test_vector_store.py` on the `feature-lancedb-schema-migrate` branch.
@@ -0,0 +1,115 @@
+# LanceDB Node Metadata Enrichment
+
+**Status:** Design
+**Date:** 2026-06-09
+**Branch target:** `dev`
+**Prerequisite for:** AI taxonomy hints (`2026-05-20-ai-taxonomy-hints-design.md`)
+**Depends on:** `feature-lancedb-schema-migrate`
+
+## Problem
+
+`build_llm_index_text` currently includes three short structured values in the embedding text:
+
+```python
+lines = [
+    f"Filename: {doc.filename}",
+    f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
+    f"Archive Serial Number: {doc.archive_serial_number or ''}",
+    ...
+]
+```
+
+These don't belong in the embedding. The embedding should capture semantic content — the meaning of the document — not structured identifiers. Including them means vectors are partly "polluted" with filing metadata, making similarity search less accurate. The existing TODO in `embedding.py:115` explicitly calls this out.
+
+The right home for structured values is `node.metadata` (excluded from the embedding, but surfaced to the LLM when nodes are retrieved as context). `title`, `tags`, `correspondent`, and `document_type` already follow this pattern.
+
+Notes and custom fields stay in the embedding text — Notes is long free text, custom fields are dynamic and their semantic content belongs in the vector.
+
+## Changes
+
+### `paperless_ai/embedding.py` — `build_llm_index_text`
+
+Remove the three lines and the TODO comment:
+
+```python
+# remove:
+f"Filename: {doc.filename}",
+f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
+f"Archive Serial Number: {doc.archive_serial_number or ''}",
+```
+
+`Notes` and `Custom Fields` lines remain.
+
+### `paperless_ai/indexing.py` — `build_document_node`
+
+Add the three fields to the metadata dict:
+
+```python
+metadata = {
+    "document_id": str(document.id),
+    "title": document.title,
+    "filename": document.filename or "",
+    "storage_path": document.storage_path.name if document.storage_path else None,
+    "archive_serial_number": document.archive_serial_number,
+    "tags": [t.name for t in document.tags.all()],
+    "correspondent": document.correspondent.name if document.correspondent else None,
+    "document_type": document.document_type.name if document.document_type else None,
+    "created": document.created.isoformat() if document.created else None,
+    "added": document.added.isoformat() if document.added else None,
+    "modified": document.modified.isoformat(),
+}
+```
+
+All three new keys must also appear in `excluded_embed_metadata_keys` (consistent with all existing keys — none of the metadata is included in the embedding text).
+
+### `paperless_ai/vector_store.py` — schema migration
+
+Register migration version 2 on the `feature-lancedb-schema-migrate` framework. The embedding text changes, so all existing vectors are stale — a full rebuild is required. The migration's `apply` is a no-op; the rebuild handles regenerating all nodes with the correct metadata.
+
+```python
+MIGRATIONS: list[Migration] = [
+    Migration(
+        version=2,
+        description="move filename/storage_path/asn from embedding text to metadata",
+        requires_reembed=True,
+        apply=lambda table: None,
+    ),
+]
+CURRENT_SCHEMA_VERSION: Final[int] = 2
+```
+
+On next `update_llm_index` run, `requires_reembed_migration()` returns `True`, triggering a full drop-and-rebuild. All new nodes carry the three metadata fields. No manual intervention required.
+
+## Impact
+
+- Similarity search quality improves slightly — vectors are more purely semantic.
+- The LLM receives `filename`, `storage_path`, and `archive_serial_number` as structured metadata alongside retrieved chunks, rather than embedded in the chunk text. Same information, cleaner separation.
+- One forced index rebuild on upgrade (beta: acceptable).
+- `node.metadata["storage_path"]`, `node.metadata["filename"]`, `node.metadata["archive_serial_number"]` are available on all retrieved nodes after rebuild — unblocks the taxonomy hints feature.
+
+## Testing
+
+All tests use pytest style — grouped under classes, `@pytest.mark.django_db` on the class, `pytest-mock`'s `mocker` fixture, every fixture and test signature type-annotated. Format with `ruff` directly.
+
+### `paperless_ai/tests/test_embedding.py` (modify)
+
+- `class TestBuildLlmIndexText:`
+  - Assert `"Filename:"` is **not** in the output.
+  - Assert `"Storage Path:"` is **not** in the output.
+  - Assert `"Archive Serial Number:"` is **not** in the output.
+  - Assert Notes and Custom Fields lines are still present (regression guard).
+
+### `paperless_ai/tests/test_ai_indexing.py` (modify)
+
+- `class TestBuildDocumentNode:`
+  - `filename` is in `node.metadata` and in `excluded_embed_metadata_keys`.
+  - `storage_path` is in `node.metadata` (name string) and in `excluded_embed_metadata_keys`; `None` when document has no storage path.
+  - `archive_serial_number` is in `node.metadata` and in `excluded_embed_metadata_keys`; `None` when unset.
+  - None of the three appear in the embedding text produced for the node.
+
+### `paperless_ai/tests/test_vector_store.py` (modify)
+
+- `class TestSchemaMigrations:`
+  - `pending_migrations()` returns the v2 migration when stored version is 1.
+  - `requires_reembed_migration()` returns `True` when stored version is 1.
+  - `apply_structural_migrations()` stops at the v2 migration (skips reembed entries).
@@ -0,0 +1,138 @@
+# LLM Index Schema Migrations (second spec)
+
+Date: 2026-06-10
+Depends on: `docs/superpowers/specs/2026-06-10-sqlite-vec-vector-store-design.md` and its implementation plan (`docs/superpowers/plans/2026-06-10-sqlite-vec-transition.md`). This spec layers on top of the completed sqlite-vec transition; do not start it before that branch lands.
+Supersedes: PR #12968 (in-place LanceDB migrations). The machinery design there is carried over nearly verbatim; only the storage backend specifics change. #12968 should be closed with a pointer here once this ships.
+
+Scope update (user decision, 2026-06-10): the `embedding.py:115` metadata restructure originally drafted as Part 2 of this spec was folded into the transition plan instead (its Task 5), because the transition forces a full rebuild anyway, so the embedded-text change rides along with no extra re-embed cost. This spec is now machinery-only: it ships with an EMPTY migration registry, ready for whatever schema change comes next. Part 2 below is retained as the worked example of how a re-embed migration would be registered, since the next one will not have a free rebuild to piggyback on.
+
+## Part 1: Schema migration machinery (ported from PR #12968)
+
+### What carries over unchanged
+
+The PR's design survives the store swap intact and is adopted as-is:
+
+- `Migration` frozen dataclass: `version: int`, `description: str`, `requires_reembed: bool`, `apply: Callable` (compare/hash-excluded field).
+- `MIGRATIONS: list[Migration]` ordered registry + `CURRENT_SCHEMA_VERSION: Final[int]` in `vector_store.py`. To add a migration: bump the constant, append an entry.
+- Store surface: `stored_schema_version() -> int` (0 when unrecorded, so pre-versioning tables treat every migration as pending), `pending_migrations()`, `requires_reembed_migration()`, `apply_structural_migrations() -> list[Migration]`.
+- The stop-at-first-reembed-boundary rule in `apply_structural_migrations()`: structural migrations are applied in version order only up to the first pending `requires_reembed=True` entry, so the version counter can never jump past a re-embed boundary and silently skip the rebuild. (This was the subtle correctness insight of #12968; preserve the comment.)
+- The `update_llm_index()` hook, verbatim from the PR:
+
+```python
+    with write_store(embed_model_name=model_name) as store:
+        if not rebuild and store.table_exists():
+            store.apply_structural_migrations()
+            if store.requires_reembed_migration():
+                logger.warning(
+                    "Schema migration requires re-embedding; forcing LLM index rebuild.",
+                )
+                rebuild = True
+```
+
+- Test approach from the PR: mock `MIGRATIONS`/`CURRENT_SCHEMA_VERSION` with `mocker.patch`, spy on `drop_table` to distinguish in-place from rebuild, one test per path (structural applied without rebuild; pending re-embed forces rebuild).
+
+### What changes for sqlite-vec
+
+**1. Version storage: `index_meta['schema_version']` instead of `schema_version.json`.**
+The Lance store needed a sidecar JSON file because Lance had no convenient mutable metadata. The sqlite-vec store already has the `index_meta` key/value table, which is transactional with the data itself (a migration and its version bump commit atomically, which the file never could). Concretely:
+
+- `_create_table(dim)` additionally writes `schema_version = str(CURRENT_SCHEMA_VERSION)` (fresh tables are always current).
+- `stored_schema_version()` reads the meta key, returns 0 on absence/garbage.
+- `drop_table()` already does `DELETE FROM index_meta`, which clears the version with it. No sidecar file, no unlink bookkeeping.
+- `apply_structural_migrations()` writes the new version inside the same transaction as the last applied migration.
+
+**2. `apply` receives the store, not a table handle.**
+Lance migrations got the raw table for `add_columns`/`alter_columns`. vec0 virtual tables do not support arbitrary `ALTER TABLE`, so structural migrations are SQL against the store's connection. Signature: `apply: Callable[[PaperlessSqliteVecVectorStore], None]`. The store exposes what migrations need: `.client` (connection), `._table_name`, `.vector_dim()`, and the rebuild helper below.
+
+**3. Structural migrations are create+copy+rename, sharing the compact() machinery.**
+The sqlite-vec `compact()` already implements the only structural mutation vec0 supports: build a new table, `INSERT INTO ... SELECT` (vectors copied bit-for-bit, no re-embedding), drop old, rename. Factor it into a shared helper on the store:
+
+```python
+def rebuild_table(
+    self,
+    *,
+    create_sql: str | None = None,
+    copy_select: str | None = None,
+) -> None:
+    """Copy live rows into a freshly created table and swap it in.
+
+    Defaults reproduce the current schema (compaction). Structural
+    migrations pass a modified CREATE statement and a matching SELECT
+    (e.g. adding a column with a literal default). Runs in one
+    transaction; VACUUM afterwards.
+    """
+```
+
+`compact()` becomes a thin caller (threshold check + `rebuild_table()`), and a structural migration like "add a `+page_count` aux column" is:
+
+```python
+Migration(
+    version=2,
+    description="add page_count auxiliary column",
+    requires_reembed=False,
+    apply=lambda store: store.rebuild_table(
+        create_sql=...,        # CREATE VIRTUAL TABLE ... with the new column
+        copy_select="SELECT id, document_id, modified, node_content, embedding, '' FROM {old}",
+    ),
+)
+```
+
+A pleasant consequence: every structural migration is also a compaction (the copy drops dead rows), and the file-format risk surface is one helper with one test suite instead of two code paths.
+
+**4. Bootstrap version for the sqlite-vec store is 1.**
+The transition plan ships the new store without machinery; tables it creates carry no `schema_version` key and therefore read as 0. This release lands with `CURRENT_SCHEMA_VERSION = 1` and `MIGRATIONS = []`, so the bootstrap is unconditionally safe: a 0-version table has no pending migrations and `apply_structural_migrations()` simply stamps it to 1. (The metadata restructure having moved into the transition itself is what makes this clean; the registry's first real entry will be v2, written against tables that are all stamped.)
+
+## Part 2 (worked example, IMPLEMENTED IN THE TRANSITION): the metadata TODO as a re-embed migration
+
+This section was implemented as Task 5 of the transition plan and ships with the store swap, not with this spec. It is kept as the reference example of how to register the next re-embed migration.
+
+### The change
+
+`build_llm_index_text()` currently embeds three short structured values in the body text:
+
+```python
+        f"Filename: {doc.filename}",
+        f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
+        f"Archive Serial Number: {doc.archive_serial_number or ''}",
+```
+
+Per the TODO, move them to `node.metadata` (excluded from embeddings, visible to the LLM via llama-index's metadata prepend), the same treatment title/tags/correspondent/document_type got in PR #12944. Notes and Custom Fields stay in the body (long free text / dynamic count, as the TODO says).
+
+1. `embedding.py build_llm_index_text()`: delete the three lines above (the `lines` list keeps Notes, Custom Fields, and Content). Update the TODO comment to describe only what remains intentional (Notes/Custom Fields stay embedded), or delete it.
+2. `indexing.py build_document_node()` metadata dict gains:
+
+```python
+        "filename": doc.filename,
+        "storage_path": document.storage_path.name if document.storage_path else None,
+        "archive_serial_number": document.archive_serial_number,
+```
+
+(`None`/int values are fine here: this dict lives in the node-content JSON, not in vec0 metadata columns; only `document_id`/`modified` are columns with the NULL restriction. Matches the existing convention of `correspondent: None`.) 3. `excluded_embed_metadata_keys=list(metadata.keys())` already covers the new keys; `excluded_llm_metadata_keys` stays `["document_id"]` so the LLM sees the new fields.
+
+### Why this class of change needs a migration
+
+Removing the three lines changes the embedded text of every document, so stored vectors no longer match what the current code would embed. Incremental updates only re-embed documents whose `modified` changed, so without a forced rebuild the index would be a mixed old/new-text population indefinitely. This particular change escaped that fate only because the transition's forced rebuild covers it. The next embedded-text change will not have that luxury and gets registered like this:
+
+```python
+CURRENT_SCHEMA_VERSION: Final[int] = 2
+
+MIGRATIONS: list[Migration] = [
+    Migration(
+        version=2,
+        description="<what changed about the embedded text>",
+        requires_reembed=True,
+        apply=lambda store: None,
+    ),
+]
+```
+
+On the first `update_llm_index` after upgrade, the hook sees the pending re-embed migration, logs, and rebuilds.
+
+### Test plan
+
+Machinery only (the metadata change is tested in the transition plan's Task 5). Port of the #12968 tests, dedicated file `test_vector_store_migrations.py`: structural migration applies in-place without `drop_table`; pending re-embed forces rebuild; version stamping on create/drop; bootstrap stamping of a pre-machinery 0-version table to 1; stop-at-boundary with a mixed [structural v2, reembed v3, structural v4] registry asserting v4 is NOT applied and the stored version stays at 2; `rebuild_table()` round-trips rows byte-for-byte (shared with compact tests).
+
+### Open questions
+
+- PR #12968 disposition: close with a comment pointing at this spec once the machinery lands (the Lance-specific `add_columns` path has no successor; vec0 cannot do in-place column adds).
+- `created`/`added` fields are also candidates for future structural metadata work, but nothing needs them now (YAGNI; noted only so the next reader does not re-derive it).
@@ -0,0 +1,155 @@
+# sqlite-vec Vector Store Design (replaces PaperlessLanceVectorStore)
+
+Date: 2026-06-10
+
+Context: LanceDB wheels SIGILL on non-AVX2 CPUs (#12970); research in `2026-06-10-vector-store-alternatives-research.md` selected sqlite-vec. This is a beta feature, so a one-time re-embed on upgrade is acceptable. Every claim marked [VERIFIED] below was empirically tested against the actual PyPI wheel (0.1.9, and 0.1.10a4 where noted), either in this repo's scratch harness (`/tmp/vstore-avx-test/explore_sqlitevec*.py`) or by the issues-audit agent.
+
+## Version pin: `sqlite-vec==0.1.9`, and why it is load-bearing
+
+- The 0.1.9 linux x86_64 wheel is built with **no SIMD flags at all** (`vec_debug()` shows empty build flags) and passed our qemu Westmere (SSE4.2, no AVX) and SandyBridge (AVX, no AVX2) emulation tests [VERIFIED]. This is the entire point of the migration.
+- The **0.1.10-alpha.4 wheel regresses this**: built with `-mavx -DSQLITE_VEC_ENABLE_AVX` file-wide, no runtime CPU dispatch. It can SIGILL on AVX-less CPUs, including Goldmont Atom/Celeron NAS boxes, exactly the #12970 user base [VERIFIED via vec_debug on the wheel].
+- Guardrails: pin `==0.1.9` exactly; log `SELECT vec_version(), vec_debug()` at store init as an AVX canary; before ever bumping to 0.1.10+, re-check the wheel flags (and consider raising the runtime-dispatch issue upstream first).
+- arm64: 0.1.9 manylinux aarch64 wheel is a proper ELF64 binary, no NEON flags baked [VERIFIED]. (The broken 32-bit "aarch64" wheel era was 0.1.6, fixed since.)
+- No sdist on PyPI (asg017/sqlite-vec#211, open) and no musl wheels; fine for our Debian-based image, blocks Alpine bare-metal installs.
+
+## Schema
+
+One dedicated SQLite database file in `LLM_INDEX_DIR` (e.g. `llmindex.db`), never the Django DB. Connections set `PRAGMA journal_mode=WAL`, `busy_timeout`, `synchronous=NORMAL`.
+
+```sql
+CREATE VIRTUAL TABLE nodes USING vec0(
+    id TEXT PRIMARY KEY,             -- node_id (uuid)
+    document_id TEXT,                -- METADATA column, deliberately NOT a partition key
+    modified TEXT,                   -- ISO timestamp; never NULL (sentinel "")
+    +node_content TEXT,              -- auxiliary column: JSON payload, any size
+    embedding float[{dim}] distance_metric=cosine
+);
+
+CREATE TABLE IF NOT EXISTS index_meta (key TEXT PRIMARY KEY, value TEXT);
+-- rows: embed_model, dim, schema_version, created_by_vec_version
+```
+
+Design decisions, each verified on 0.1.9:
+
+- **`document_id` is a metadata column, not a partition key.** With a partition key, `k` applies per partition: `k=5 AND document_id IN (3 docs)` returns 15 rows (asg017/sqlite-vec#142, open) [VERIFIED]. As a metadata column the same query returns a correct global top-k of exactly 5 [VERIFIED]. `query_similar_documents()` passes permission-scoped `IN` lists, so per-partition semantics would over-fetch k x N(docs). At our scale the partition-pruning speedup is not needed (filtered KNN at 20K x 1024 was _faster_ than unfiltered: 39 ms vs 74 ms).
+- **One document column, not two.** The Lance store carried both `doc_id` (ref_doc_id) and `document_id`; in our usage they are always the same value (`str(document.id)`), so the new schema keeps only `document_id`.
+- **TEXT primary key works** (insert, UPDATE, DELETE, duplicate rejection) [VERIFIED]. There is no usable rowid mapping with a TEXT pk, which we do not need.
+- **Aux column for the payload.** `+node_content` holds the multi-KB JSON; aux columns cannot appear in KNN WHERE clauses (loud error, not silent) [VERIFIED], which we never do, and are selectable in scans and KNN results [VERIFIED].
+- **Metadata columns reject NULL** (asg017/sqlite-vec#141, open) [VERIFIED]. `_row()` must keep coercing everything through `str(... or "")` as it already does today.
+- **`distance_metric=cosine`**: similarity maps as `1 - distance` (identical vector gives distance 0.0 [VERIFIED]). For unit-norm embeddings the ranking equals today's L2 ranking; for non-normalized models cosine is the safer default, and the beta re-embed makes the behavior change free. (L2 + `1/(1+d)` remains available if exact parity is ever wanted.)
+- **Vectors are always bound as float32 BLOBs** (`struct.pack`/`np.tobytes`), never JSON text: bypasses the locale-dependent `strtod` parsing bug (asg017/sqlite-vec#241, open) entirely.
+- Limits, all comfortable: dims <= 8192, k <= 4096, chunk_size default 1024 [VERIFIED]. TEXT metadata has no length cap; values > 12 bytes go to a shadow text table with a prefix fast-path, and the one historical bug at that boundary (long-metadata DELETE, #274) is fixed in 0.1.9.
+
+## Method mapping (PaperlessLanceVectorStore -> PaperlessSqliteVecVectorStore)
+
+| Current method                                | sqlite-vec implementation                                                                                                              | Notes                                                                                                                                                                                                                   |
+| --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `__init__(uri, table_name, embed_model_name)` | `sqlite3.connect(path)` + `enable_load_extension` + `sqlite_vec.load()` + PRAGMAs                                                      | Same lazy "table may not exist yet" stance                                                                                                                                                                              |
+| `client` property                             | the `sqlite3.Connection`                                                                                                               |                                                                                                                                                                                                                         |
+| `table_exists()`                              | `SELECT 1 FROM sqlite_master WHERE name='nodes'`                                                                                       |                                                                                                                                                                                                                         |
+| `vector_dim()`                                | `index_meta['dim']`                                                                                                                    | Written at table creation; wrong-dim inserts are rejected by vec0 anyway [VERIFIED]                                                                                                                                     |
+| `drop_table()`                                | `DROP TABLE nodes`                                                                                                                     | Drops all 7 shadow tables with it [VERIFIED]; also clear `index_meta`                                                                                                                                                   |
+| `stored_model_name()` / `config_mismatch()`   | `index_meta['embed_model']`                                                                                                            | Same conservative None handling                                                                                                                                                                                         |
+| `_schema(dim, model)`                         | the CREATE statements above                                                                                                            | dim from first batch, as today (`_ensure_table`)                                                                                                                                                                        |
+| `_row(node)`                                  | same dict, vector packed to bytes                                                                                                      | keep `str(... or "")` coercion (NULL rejection)                                                                                                                                                                         |
+| `add(nodes)`                                  | `executemany(INSERT ...)` inside one transaction                                                                                       | ~3,300 rows/s at 1024 dims measured; batching via transactions                                                                                                                                                          |
+| `upsert_document(document_id, nodes)`         | `BEGIN; DELETE FROM nodes WHERE document_id = ?; executemany(INSERT); COMMIT`                                                          | **Not** `INSERT OR REPLACE`: broken on vec0 (asg017/sqlite-vec#259, open). Transaction gives the same no-transient-empty-state guarantee as merge_insert; rollback verified [VERIFIED]                                  |
+| `delete(ref_doc_id)`                          | `DELETE FROM nodes WHERE document_id = ?`                                                                                              |                                                                                                                                                                                                                         |
+| `get_nodes(filters)`                          | `SELECT id, document_id, node_content, embedding FROM nodes [WHERE ...]`                                                               | full scans on vec0 work [VERIFIED]; 45 ms / 20K rows                                                                                                                                                                    |
+| `query(VectorStoreQuery)`                     | `SELECT id, node_content, embedding, distance FROM nodes WHERE embedding MATCH ? AND k = ? [AND filters]` then Python-slice to `top_k` | `k = ?` is mandatory; `LIMIT` cannot be combined with `k` [VERIFIED]; results arrive distance-sorted [VERIFIED]; similarities = `1 - distance`                                                                          |
+| `_build_where(filters)`                       | same EQ/IN translation, but emitting `?` placeholders + params list                                                                    | **Upgrade**: bound parameters replace today's manual `_escape()` string interpolation                                                                                                                                   |
+| `get_modified_times()`                        | `SELECT document_id, modified FROM nodes` + first-seen dedupe in Python                                                                | identical logic                                                                                                                                                                                                         |
+| `ensure_document_id_scalar_index()`           | no-op (delete if nothing else needs it)                                                                                                | metadata filters are evaluated in the chunk scan; nothing to create                                                                                                                                                     |
+| `maybe_create_ann_index()`                    | no-op on 0.1.9                                                                                                                         | ANN (rescore/diskann) is 0.1.10-alpha territory; adopting an ANN index makes the file unreadable by 0.1.9 (one-way door), while flat tables round-trip 0.1.9 <-> 0.1.10a4 cleanly [VERIFIED]. Revisit post-0.1.10-final |
+| `compact(retention_seconds)`                  | **rebuild-based compaction**, see below                                                                                                | replaces Lance MVCC cleanup                                                                                                                                                                                             |
+
+Filter constraint surface (loud errors otherwise, [VERIFIED]): only `=, !=, <, <=, >, >=, IN` on metadata columns in KNN queries. We use only EQ/IN. Never use `NOT IN` (the vtab cannot see it; SQLite post-filters and silently under-delivers below k, asg017/sqlite-vec#116).
+
+## Compaction: the one real behavioral difference
+
+vec0 DELETE only flips a validity bit; space is never reclaimed, and VACUUM recovers only about half (asg017/sqlite-vec#54, #220, open; fix PRs #243/#210 unmerged). Measured: 5 delete+reinsert cycles on 2K rows grew the file 3.32 MB -> 6.56 MB; VACUUM got back to 4.94 MB. Paperless's per-document churn (every document edit is a delete+reinsert) hits this directly.
+
+So `compact()` becomes the maintainer-endorsed rebuild (asg017/sqlite-vec#205):
+
+```sql
+CREATE VIRTUAL TABLE nodes_new USING vec0(...);
+INSERT INTO nodes_new SELECT id, document_id, modified, node_content, embedding FROM nodes;
+DROP TABLE nodes;
+ALTER TABLE nodes_new RENAME TO nodes;  -- then VACUUM
+```
+
+This copies vectors without re-embedding, runs under the existing write FileLock, and slots into the existing `document_llmindex compact` command and the scheduled maintenance task. A cheap trigger heuristic: rebuild when `count(*) in nodes_rowids shadow` (cumulative) exceeds ~2x live rows, or just keep the existing scheduled cadence.
+
+## Concurrency
+
+vec0 is a plain vtab over ordinary shadow tables, so standard SQLite WAL semantics apply, and the existing architecture is already the textbook arrangement: writers serialized by `settings.LLM_INDEX_LOCK` FileLock, readers concurrent via WAL. Verified across processes: a reader during another process's open write transaction does not block and sees a consistent pre-transaction snapshot; post-commit it sees the new rows [VERIFIED]. No sqlite-vec-specific multi-process corruption, locking, or segfault reports exist in the tracker. The 0.1.10a4 cached-statement fix (#295) is a Firefox/mozStorage `sqlite3_close()` issue; CPython's `sqlite3` is unaffected, no Python-side reports.
+
+Same caveat as the main SQLite DB: `LLM_INDEX_DIR` should not be on NFS.
+
+## Performance expectations (measured on the 0.1.9 no-SIMD wheel)
+
+- KNN 20K rows x 1024 dims: ~74 ms plain, ~39 ms with a metadata EQ filter.
+- 100K x 768: 185 ms/query (vs 497 ms for LanceDB exact search on identical data).
+- Extrapolated 500K x 1024-1536: ~0.9-1.8 s/query; 384 dims roughly 4x faster. Acceptable for suggestions/chat at the extreme tail; typical installs (low tens of thousands of chunks) are tens of ms.
+- Insert: ~3,300 rows/s at 1024 dims in a single transaction.
+- File size: ~raw vector size (~4.3 KB/row at 1024 dims), no compression; plus the bloat behavior above.
+
+## Migration from the Lance store
+
+Beta policy: re-embed. On startup/first index task: if `LLM_INDEX_DIR` contains a Lance table but no `llmindex.db`, log and queue a full rebuild, then remove the Lance directory. No cross-store vector copy, no lancedb import anywhere in the path (which is what un-breaks #12970 hosts: they currently crash at import, have no usable index, and get a fresh build).
+
+PR #12968's migration machinery maps onto `index_meta['schema_version']`: structural migrations = create-new-table + `INSERT ... SELECT` + rename (vectors copied, no re-embed; same shape as the compaction rebuild); re-embed migrations = drop + full rebuild, jumping straight to the current version.
+
+## Dependency changes
+
+- Add: `sqlite-vec==0.1.9` (one ~100 KB platform wheel, zero Python deps).
+- Remove: `lancedb~=0.33.0` (and its pylance/lancedb wheels, ~40 MB). `pyarrow` leaves this module; check whether anything else in the AI stack still needs it before dropping from pyproject.
+
+## Test plan notes
+
+- pytest-style per project convention; the store tests can run against a tmp_path DB file (or `:memory:` for pure-logic tests; extension loading works on uv-managed CPython [VERIFIED]).
+- Port the existing `test_vector_store.py` surface; add dedicated tests for: upsert transactionality (no transient empty state mid-upsert from a second connection), NULL-coercion in `_row()`, k-slice behavior, EQ/IN filter correctness, compaction rebuild preserving rows byte-for-byte, vec_debug canary logging.
+- The qemu matrix (`/tmp/vstore-avx-test/`) can be re-run against any future sqlite-vec bump: `qemu-x86_64 -cpu Westmere venv/bin/python candidate_test.py sqlite_vec <dir>`.
+
+## Benchmark harness
+
+`src/bench_vector_store.py` -- standalone head-to-head comparison run during the migration window when both `PaperlessLanceVectorStore` and `PaperlessSqliteVecVectorStore` coexist (Task 3 Phase A of the implementation plan). After Phase B replaces `vector_store.py`, the Lance import fails gracefully and only the sqlite-vec half runs (useful for post-migration baseline checks).
+
+```bash
+cd src
+uv run python bench_vector_store.py            # auto-generates bench_data.pkl on first run
+uv run python bench_vector_store.py --regenerate  # force re-embed
+```
+
+**Phase 1 (data generation, skipped if `bench_data.pkl` exists):** Faker generates `--n-docs` (default 2000) fake documents -- title, body, correspondent, ISO timestamp. Each body is split into `--chunks-per-doc` (default 3) equal-length chunks (~6000 total nodes). A warm-up embed call fires before generation to ensure the model is resident in GPU. All chunk texts are embedded via Ollama `/api/embed` in batches of 32 and saved to `bench_data.pkl`. Faker seed 42 for reproducibility.
+
+**Phase 2 (benchmark):** Each store runs in an isolated `tempfile.TemporaryDirectory()`. Query vectors are drawn reproducibly from the corpus (every 10th node, wrapping).
+
+| Operation                                 | Reps | Metric                |
+| ----------------------------------------- | ---- | --------------------- |
+| `add()` bulk insert                       | 1    | total time            |
+| `query()` plain                           | 50   | p50 / p95             |
+| `query()` filtered (IN on 20% of doc IDs) | 50   | p50 / p95             |
+| `get_modified_times()`                    | 20   | p50                   |
+| `upsert_document()`                       | 50   | p50 / p95             |
+| `compact()`                               | 1    | total time            |
+| File size                                 | --   | pre- and post-compact |
+
+**CLI flags:** `--n-docs` (2000), `--chunks-per-doc` (3), `--data-file` (`bench_data.pkl`), `--regenerate`, `--ollama-url` (`http://192.168.1.87:11434`), `--embed-model` (`qwen3-embedding:4b`), `--query-iters` (50).
+
+**Dependencies:** `faker` and `httpx` must be available (`uv add --dev faker httpx` if not already installed).
+
+## Risk register (from the 2026-06-10 issues audit)
+
+| Risk                                        | Ref                                     | State          | Disposition                                                                                                                                       |
+| ------------------------------------------- | --------------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
+| 0.1.10+ wheels bake AVX, no dispatch        | release CI change, verified on 0.1.10a4 | current        | Pin 0.1.9; vec_debug canary; upstream ask before any bump                                                                                         |
+| DELETE never reclaims space; VACUUM ~50%    | #54, #220                               | open           | Rebuild-based `compact()` above                                                                                                                   |
+| INSERT OR REPLACE broken on vec0            | #259                                    | open           | Use DELETE+INSERT in txn (design already does)                                                                                                    |
+| NULL metadata rejected                      | #141                                    | open           | Sentinel `""` coercion (already current behavior)                                                                                                 |
+| Partition-key IN returns k per partition    | #142                                    | open           | Avoided: document_id is a metadata column                                                                                                         |
+| NOT IN silently under-delivers              | #116                                    | open           | Never emit NOT IN                                                                                                                                 |
+| Locale strtod breaks JSON vector parsing    | #241                                    | open           | Always BLOB-bind vectors                                                                                                                          |
+| Single weekend maintainer; fix PRs languish | #226                                    | open           | Mitigated by Mozilla sponsorship + Firefox vendoring (release-train consumer); pin + vendor-from-source remains the escape hatch (no sdist: #211) |
+| ANN index = one-way file format             | 0.1.10 alphas                           | —              | Do not adopt ANN until 0.1.10 final + flag audit                                                                                                  |
+| Long-TEXT metadata DELETE bug               | #274                                    | fixed in 0.1.9 | Floor requirement `>=0.1.9` already implied by pin                                                                                                |