diff --git a/docs/superpowers/plans/2026-06-02-lancedb-vector-store.md b/docs/superpowers/plans/2026-06-02-lancedb-vector-store.md deleted file mode 100644 index 94503abdd..000000000 --- a/docs/superpowers/plans/2026-06-02-lancedb-vector-store.md +++ /dev/null @@ -1,1721 +0,0 @@ -# LanceDB Vector Store Implementation Plan - -> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. - -**Goal:** Replace the AI feature's FAISS + `SimpleDocumentStore` + `SimpleIndexStore` llama-index storage with a single LanceDB table fronted by a custom `BasePydanticVectorStore` adapter, eliminating fake deletes, whole-file rewrites, the custom chat retriever, and pandas. - -**Architecture:** A new `paperless_ai/vector_store.py` defines `PaperlessLanceVectorStore`, a llama-index `BasePydanticVectorStore` talking to `lancedb` + `pyarrow` directly. `indexing.py` is rewired to build the index from that store alone (`VectorStoreIndex.from_vector_store`), add/update via atomic `merge_insert` upsert, remove via predicate delete, and query/similar/chat via stock retrievers with `MetadataFilters`. Disk bloat from MVCC is reclaimed with `optimize(cleanup_older_than=...)` folded into the scheduled `update_llm_index`. - -**Tech Stack:** Python 3.11+, Django, llama-index-core, lancedb 0.33.x, pyarrow, pytest + pytest-django + pytest-mock, factory-boy, uv. - -**Reference spec:** `docs/superpowers/specs/2026-06-02-lancedb-vector-store-design.md` - ---- - -## Conventions for this plan - -- Backend tests are **pytest-style, grouped in classes**, with `@pytest.mark.django_db` on the class when DB access is needed. Annotate fixture params, fixture return types, and test signatures. Use the `mocker` fixture (pytest-mock), not bare `patch`. Build models with `DocumentFactory` from `documents/tests/factories.py`. (See `CLAUDE.md`.) -- Run a single test: `uv run pytest src/paperless_ai/tests/test_vector_store.py::TestClass::test_x -v` -- Lint/format with the **global** `ruff`: `ruff check src/paperless_ai` and `ruff format src/paperless_ai` (not `uv run ruff`). -- **Tests cannot be executed in the authoring session on this machine** — where a step says "Run … Expected: PASS", the implementer runs it and confirms before moving on. -- Commit messages end with the trailer: - `Co-Authored-By: Claude Opus 4.8 (1M context) ` - ---- - -## HARD CONSTRAINT: lazy imports of AI libraries - -**`llama_index`, `lancedb`, and `pyarrow` must never be imported at module-load time of any module reachable from a non-AI entry point.** A simple management command must not transitively drag in gigabytes of AI libraries. This is a known past regression and a hard requirement. - -The existing code already enforces this pattern, and this plan must preserve it: - -- `documents/tasks.py` imports `paperless_ai.indexing` **at module top**, and management commands import `documents.tasks` / `documents.models`. So anything `indexing.py` pulls at module-load time lands in the light path. -- Today `indexing.py` and `embedding.py` keep all `llama_index` / `faiss` imports **function-local** (e.g. `build_document_node` imports `LlamaDocument` inside the function). Importing `indexing.py` does **not** import `llama_index`. - -Rules for this plan: - -1. `paperless_ai/vector_store.py` (the new adapter) **may** import `lancedb` / `pyarrow` / `llama_index` at its top level — it is pure AI code. -2. **`indexing.py` must import `vector_store` only inside functions** (e.g. inside `get_vector_store()`), never at module top. Use `if TYPE_CHECKING:` for type hints. -3. **Any `llama_index` symbol used in `indexing.py` / `chat.py` (including `MetadataMode`, retrievers, filters) must be imported inside the function that uses it**, never at module top. -4. Test modules under `paperless_ai/tests/` **may** import these at the top — they are AI tests. -5. A subprocess guard test (Task 6) asserts that importing `documents.tasks` leaves `lancedb` / `pyarrow` / `llama_index` absent from `sys.modules`. - ---- - -## File Structure - -- **Create** `src/paperless_ai/vector_store.py` — `PaperlessLanceVectorStore` adapter (schema, add, upsert_document, delete, get_nodes, query, `_build_where`, `maybe_create_ann_index`, `optimize`). Single responsibility: the LanceDB ↔ llama-index storage boundary. -- **Create** `src/paperless_ai/tests/test_vector_store.py` — adapter unit/integration tests. -- **Modify** `src/paperless_ai/indexing.py` — factory + load/build/add/update/remove/similar functions rewired to the adapter; delete `get_or_create_storage_context`, `remove_document_docstore_nodes`; change `build_document_node`, `vector_store_file_exists`, `update_llm_index`, `llm_index_add_or_update_document`, `llm_index_remove_document`, `query_similar_documents`, `load_or_build_index`. -- **Modify** `src/paperless_ai/chat.py` — delete `_get_document_filtered_retriever`; use stock `VectorIndexRetriever` with filters; switch the no-content pre-check to `store.get_nodes`. -- **Modify** `src/documents/tasks.py` — call adapter compaction at the end of `update_llm_index` (via indexing helper) — no new beat task. -- **Modify** `pyproject.toml` — drop `faiss-cpu`, `llama-index-vector-stores-faiss`; add `lancedb`, `pyarrow`. -- **Modify** `src/paperless_ai/embedding.py` — add a `current_embedding_dim()` helper used by the dimension-mismatch guard (logic already mostly present in `get_embedding_dim`). -- **Modify** `src/paperless_ai/tests/test_ai_indexing.py`, `src/paperless_ai/tests/test_chat.py` — update tests that referenced FAISS/docstore internals. - ---- - -## Task 1: Swap dependencies - -**Files:** - -- Modify: `pyproject.toml:45` (remove `faiss-cpu`), `pyproject.toml:60` (remove `llama-index-vector-stores-faiss`), and add `lancedb` + `pyarrow` in alphabetical position. - -- [ ] **Step 1: Remove the FAISS dependencies** - -In `pyproject.toml`, delete these two lines from the `dependencies` array: - -```toml - "faiss-cpu>=1.10", -``` - -```toml - "llama-index-vector-stores-faiss>=0.5.2", -``` - -- [ ] **Step 2: Add lancedb and pyarrow** - -In the same `dependencies` array, add (keep the array alphabetized — `lancedb` goes just before `langdetect`, `pyarrow` just before `python-dateutil`): - -```toml - "lancedb~=0.33.0", -``` - -```toml - "pyarrow>=16", -``` - -- [ ] **Step 3: Resolve the lockfile** - -Run: `uv sync --group dev` -Expected: resolves and installs; `faiss-cpu` and `llama-index-vector-stores-faiss` are removed, `lancedb`/`pyarrow` added. No `pandas` is added. - -- [ ] **Step 4: Verify pandas is absent and lancedb imports** - -Run: `uv run python -c "import importlib.util as u; import lancedb, pyarrow; print('lancedb', lancedb.__version__); print('pandas present:', u.find_spec('pandas') is not None)"` -Expected: prints the lancedb version and `pandas present: False`. - -- [ ] **Step 5: Verify multi-arch wheels resolved** - -Run: `uv pip show lancedb pyarrow` -Expected: both shown with versions. (Linux x86_64 + aarch64 wheels exist for lancedb 0.33.x — confirm CI Docker build later.) - -- [ ] **Step 6: Commit** - -```bash -git add pyproject.toml uv.lock -git commit -m "build: replace faiss-cpu with lancedb for the AI vector store - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 2: The adapter — schema, add, delete, get_nodes, query - -**Files:** - -- Create: `src/paperless_ai/vector_store.py` -- Test: `src/paperless_ai/tests/test_vector_store.py` - -- [ ] **Step 1: Write the failing test for add → query round-trip** - -Create `src/paperless_ai/tests/test_vector_store.py`: - -```python -from pathlib import Path - -import pytest -from llama_index.core.schema import TextNode -from llama_index.core.vector_stores.types import FilterOperator -from llama_index.core.vector_stores.types import MetadataFilter -from llama_index.core.vector_stores.types import MetadataFilters -from llama_index.core.vector_stores.types import VectorStoreQuery - -from paperless_ai.vector_store import PaperlessLanceVectorStore - -DIM = 8 - - -def _node(node_id: str, document_id: str, text: str, vec: float) -> TextNode: - node = TextNode(id_=node_id, text=text, metadata={"document_id": document_id}) - node.set_content(text) - node.embedding = [vec] * DIM - node.relationships = {} - node.ref_doc_id = document_id - return node - - -class TestPaperlessLanceVectorStoreCrud: - @pytest.fixture - def store(self, tmp_path: Path) -> PaperlessLanceVectorStore: - return PaperlessLanceVectorStore(uri=str(tmp_path / "idx")) - - def test_add_then_query_returns_node( - self, - store: PaperlessLanceVectorStore, - ) -> None: - store.add([_node("1-0", "1", "alpha", 0.1), _node("2-0", "2", "beta", 0.9)]) - - result = store.query( - VectorStoreQuery(query_embedding=[0.1] * DIM, similarity_top_k=1), - ) - - assert len(result.nodes) == 1 - assert result.nodes[0].metadata["document_id"] == "1" - - def test_query_empty_table_returns_empty_no_raise( - self, - store: PaperlessLanceVectorStore, - ) -> None: - result = store.query( - VectorStoreQuery(query_embedding=[0.1] * DIM, similarity_top_k=5), - ) - assert result.nodes == [] - assert result.ids == [] -``` - -- [ ] **Step 2: Run the test to verify it fails** - -Run: `uv run pytest src/paperless_ai/tests/test_vector_store.py -v` -Expected: FAIL with `ModuleNotFoundError: No module named 'paperless_ai.vector_store'`. - -- [ ] **Step 3: Write the adapter (add/delete/get_nodes/query/client)** - -Create `src/paperless_ai/vector_store.py`: - -```python -import json -import logging -from typing import Any - -import lancedb -import pyarrow as pa -from llama_index.core.bridge.pydantic import PrivateAttr -from llama_index.core.schema import BaseNode -from llama_index.core.vector_stores.types import BasePydanticVectorStore -from llama_index.core.vector_stores.types import FilterCondition -from llama_index.core.vector_stores.types import FilterOperator -from llama_index.core.vector_stores.types import MetadataFilters -from llama_index.core.vector_stores.types import VectorStoreQuery -from llama_index.core.vector_stores.types import VectorStoreQueryResult -from llama_index.core.vector_stores.utils import metadata_dict_to_node -from llama_index.core.vector_stores.utils import node_to_metadata_dict - -logger = logging.getLogger("paperless_ai.vector_store") - -DEFAULT_TABLE_NAME = "documents" - - -def _escape(value: str) -> str: - return str(value).replace("'", "''") - - -def _build_where(filters: MetadataFilters | None) -> str | None: - """Translate the EQ / IN filters we use into a Lance SQL predicate on the - top-level ``document_id`` column.""" - if filters is None or not filters.filters: - return None - clauses: list[str] = [] - for f in filters.filters: - if f.operator == FilterOperator.IN: - vals = ",".join(f"'{_escape(v)}'" for v in f.value) - clauses.append(f"{f.key} IN ({vals})") - elif f.operator == FilterOperator.EQ: - clauses.append(f"{f.key} = '{_escape(f.value)}'") - else: # pragma: no cover - we only ever build EQ/IN filters - raise NotImplementedError(f"Unsupported filter operator: {f.operator}") - joiner = " OR " if filters.condition == FilterCondition.OR else " AND " - return joiner.join(clauses) - - -class PaperlessLanceVectorStore(BasePydanticVectorStore): - """A llama-index vector store backed directly by a LanceDB table. - - Stores one row per node with the node id, its document id (both as the - ``ref_doc_id`` delete key ``doc_id`` and a top-level filter column - ``document_id``), the embedding, and the serialised node (text + metadata) - as JSON. ``stores_text`` lets llama-index run off this store alone, with no - separate docstore or index store. - """ - - stores_text: bool = True - flat_metadata: bool = True - - _uri: str = PrivateAttr() - _table_name: str = PrivateAttr() - _conn: Any = PrivateAttr() - _table: Any = PrivateAttr() - - def __init__(self, uri: str, table_name: str = DEFAULT_TABLE_NAME) -> None: - super().__init__() - self._uri = uri - self._table_name = table_name - self._conn = lancedb.connect(uri) - existing = list(self._conn.table_names()) - self._table = ( - self._conn.open_table(table_name) if table_name in existing else None - ) - - @property - def client(self) -> Any: - return self._conn - - def table_exists(self) -> bool: - return self._table_name in list(self._conn.table_names()) - - def vector_dim(self) -> int | None: - if self._table is None: - return None - return self._table.schema.field("vector").type.list_size - - def drop_table(self) -> None: - if self.table_exists(): - self._conn.drop_table(self._table_name) - self._table = None - - @staticmethod - def _schema(dim: int) -> pa.Schema: - return pa.schema( - [ - pa.field("id", pa.string()), - pa.field("doc_id", pa.string()), - pa.field("document_id", pa.string()), - pa.field("vector", pa.list_(pa.float32(), dim)), - pa.field("node_content", pa.string()), - ], - ) - - def _row(self, node: BaseNode) -> dict[str, Any]: - meta = node_to_metadata_dict( - node, - remove_text=False, - flat_metadata=self.flat_metadata, - ) - return { - "id": node.node_id, - "doc_id": node.ref_doc_id, - "document_id": str(node.metadata.get("document_id")), - "vector": node.get_embedding(), - "node_content": json.dumps(meta), - } - - def add(self, nodes: list[BaseNode], **add_kwargs: Any) -> list[str]: - if not nodes: - return [] - rows = [self._row(node) for node in nodes] - if self._table is None: - dim = len(nodes[0].get_embedding()) - self._table = self._conn.create_table( - self._table_name, - rows, - schema=self._schema(dim), - ) - else: - self._table.add(rows) - return [node.node_id for node in nodes] - - def delete(self, ref_doc_id: str, **delete_kwargs: Any) -> None: - if self._table is not None: - self._table.delete(f'doc_id = "{_escape(ref_doc_id)}"') - - def delete_nodes( - self, - node_ids: list[str] | None = None, - filters: MetadataFilters | None = None, - **delete_kwargs: Any, - ) -> None: - if self._table is None: - return - if node_ids: - ids = ",".join(f'"{_escape(n)}"' for n in node_ids) - self._table.delete(f"id IN ({ids})") - elif filters is not None: - where = _build_where(filters) - if where: - self._table.delete(where) - - def _rows_to_nodes(self, rows: list[dict[str, Any]]) -> list[BaseNode]: - nodes: list[BaseNode] = [] - for row in rows: - node = metadata_dict_to_node(json.loads(row["node_content"])) - node.embedding = list(row["vector"]) - nodes.append(node) - return nodes - - def get_nodes( - self, - node_ids: list[str] | None = None, - filters: MetadataFilters | None = None, - **kwargs: Any, - ) -> list[BaseNode]: - if self._table is None: - return [] - query = self._table.search() - where = _build_where(filters) - if node_ids: - ids = ",".join(f'"{_escape(n)}"' for n in node_ids) - query = query.where(f"id IN ({ids})") - elif where: - query = query.where(where) - return self._rows_to_nodes(query.to_list()) - - def query( - self, - query: VectorStoreQuery, - **kwargs: Any, - ) -> VectorStoreQueryResult: - if self._table is None: - return VectorStoreQueryResult(nodes=[], similarities=[], ids=[]) - top_k = query.similarity_top_k or 10 - search = self._table.search(query.query_embedding).limit(top_k) - where = _build_where(query.filters) - if where: - search = search.where(where) - rows = search.to_list() - nodes = self._rows_to_nodes(rows) - # LanceDB returns squared-L2 distance; map to a descending similarity. - sims = [1.0 / (1.0 + float(row["_distance"])) for row in rows] - ids = [row["id"] for row in rows] - return VectorStoreQueryResult(nodes=nodes, similarities=sims, ids=ids) -``` - -- [ ] **Step 4: Run the tests to verify they pass** - -Run: `uv run pytest src/paperless_ai/tests/test_vector_store.py -v` -Expected: PASS (both tests). - -- [ ] **Step 5: Add delete / filter / get_nodes / fresh-process tests** - -Append to `src/paperless_ai/tests/test_vector_store.py` inside `TestPaperlessLanceVectorStoreCrud`: - -```python - def test_delete_removes_all_chunks_of_document( - self, - store: PaperlessLanceVectorStore, - ) -> None: - store.add([_node("1-0", "1", "a", 0.1), _node("1-1", "1", "b", 0.2)]) - store.add([_node("2-0", "2", "c", 0.9)]) - - store.delete("1") - - assert store.client.open_table("documents").count_rows() == 1 - - def test_query_with_in_filter_scopes_results( - self, - store: PaperlessLanceVectorStore, - ) -> None: - store.add([_node("1-0", "1", "a", 0.1), _node("2-0", "2", "b", 0.1)]) - - result = store.query( - VectorStoreQuery( - query_embedding=[0.1] * DIM, - similarity_top_k=5, - filters=MetadataFilters( - filters=[ - MetadataFilter( - key="document_id", - operator=FilterOperator.IN, - value=["2"], - ), - ], - ), - ), - ) - - assert [n.metadata["document_id"] for n in result.nodes] == ["2"] - - def test_get_nodes_filter_returns_empty_cleanly( - self, - store: PaperlessLanceVectorStore, - ) -> None: - store.add([_node("1-0", "1", "a", 0.1)]) - nodes = store.get_nodes( - filters=MetadataFilters( - filters=[ - MetadataFilter( - key="document_id", - operator=FilterOperator.IN, - value=["999"], - ), - ], - ), - ) - assert nodes == [] - - def test_fresh_instance_filters_existing_table( - self, - tmp_path: Path, - ) -> None: - uri = str(tmp_path / "idx") - PaperlessLanceVectorStore(uri=uri).add( - [_node("1-0", "1", "a", 0.1), _node("2-0", "2", "b", 0.1)], - ) - - reopened = PaperlessLanceVectorStore(uri=uri) - result = reopened.query( - VectorStoreQuery( - query_embedding=[0.1] * DIM, - similarity_top_k=5, - filters=MetadataFilters( - filters=[ - MetadataFilter( - key="document_id", - operator=FilterOperator.IN, - value=["1"], - ), - ], - ), - ), - ) - assert [n.metadata["document_id"] for n in result.nodes] == ["1"] - - def test_table_exists_and_drop( - self, - store: PaperlessLanceVectorStore, - ) -> None: - assert store.table_exists() is False - store.add([_node("1-0", "1", "a", 0.1)]) - assert store.table_exists() is True - assert store.vector_dim() == DIM - store.drop_table() - assert store.table_exists() is False -``` - -- [ ] **Step 6: Run all adapter tests** - -Run: `uv run pytest src/paperless_ai/tests/test_vector_store.py -v` -Expected: PASS (all 7 tests). - -- [ ] **Step 7: Lint and commit** - -```bash -ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -git commit -m "feat(ai): add LanceDB-backed vector store adapter - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 3: Atomic upsert (`upsert_document`) - -**Files:** - -- Modify: `src/paperless_ai/vector_store.py` -- Test: `src/paperless_ai/tests/test_vector_store.py` - -- [ ] **Step 1: Write the failing test for shrink-on-update pruning + single commit** - -Append a new class to `src/paperless_ai/tests/test_vector_store.py`: - -```python -class TestPaperlessLanceVectorStoreUpsert: - @pytest.fixture - def store(self, tmp_path: Path) -> PaperlessLanceVectorStore: - s = PaperlessLanceVectorStore(uri=str(tmp_path / "idx")) - s.add( - [ - _node("1-0", "1", "old0", 0.1), - _node("1-1", "1", "old1", 0.2), - _node("1-2", "1", "old2", 0.3), - _node("2-0", "2", "keep", 0.9), - ], - ) - return s - - def test_upsert_prunes_stale_chunks_and_keeps_others( - self, - store: PaperlessLanceVectorStore, - ) -> None: - store.upsert_document( - "1", - [_node("1-0", "1", "new0", 0.1), _node("1-1", "1", "new1", 0.2)], - ) - - table = store.client.open_table("documents") - doc1 = sorted( - r["id"] for r in table.search().where("document_id = '1'").to_list() - ) - assert doc1 == ["1-0", "1-1"] # 1-2 pruned - assert table.count_rows() == 3 # 2 new doc1 + 1 doc2 - - def test_upsert_is_single_commit( - self, - store: PaperlessLanceVectorStore, - ) -> None: - table = store.client.open_table("documents") - before = table.version - store.upsert_document("1", [_node("1-0", "1", "new0", 0.1)]) - assert store.client.open_table("documents").version == before + 1 -``` - -- [ ] **Step 2: Run to verify it fails** - -Run: `uv run pytest src/paperless_ai/tests/test_vector_store.py::TestPaperlessLanceVectorStoreUpsert -v` -Expected: FAIL with `AttributeError: 'PaperlessLanceVectorStore' object has no attribute 'upsert_document'`. - -- [ ] **Step 3: Implement `upsert_document`** - -Add to `PaperlessLanceVectorStore` in `src/paperless_ai/vector_store.py`, after `add`: - -```python - def upsert_document(self, document_id: str, nodes: list[BaseNode]) -> list[str]: - """Atomically replace all stored chunks of ``document_id`` with ``nodes``. - - A single ``merge_insert`` commit: matching node ids are updated, new ids - inserted, and any existing rows for this document that are not in the new - set are deleted (``when_not_matched_by_source_delete``). This prunes stale - trailing chunks when an edit reduces a document's chunk count, with no - transient empty state for concurrent lock-free readers. - """ - if not nodes: - # No indexable content: treat as a removal. - self.delete(document_id) - return [] - rows = [self._row(node) for node in nodes] - if self._table is None: - dim = len(nodes[0].get_embedding()) - self._table = self._conn.create_table( - self._table_name, - rows, - schema=self._schema(dim), - ) - return [node.node_id for node in nodes] - ( - self._table.merge_insert("id") - .when_matched_update_all() - .when_not_matched_insert_all() - .when_not_matched_by_source_delete( - f"document_id = '{_escape(document_id)}'", - ) - .execute(rows) - ) - return [node.node_id for node in nodes] -``` - -- [ ] **Step 4: Run to verify it passes** - -Run: `uv run pytest src/paperless_ai/tests/test_vector_store.py::TestPaperlessLanceVectorStoreUpsert -v` -Expected: PASS (both tests). - -- [ ] **Step 5: Lint and commit** - -```bash -ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -git commit -m "feat(ai): atomic upsert_document on the LanceDB store - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 4: ANN index threshold, scalar index, and compaction - -**Files:** - -- Modify: `src/paperless_ai/vector_store.py` -- Test: `src/paperless_ai/tests/test_vector_store.py` - -- [ ] **Step 1: Write the failing tests** - -Append a new class to `src/paperless_ai/tests/test_vector_store.py`: - -```python -class TestPaperlessLanceVectorStoreMaintenance: - @pytest.fixture - def store(self, tmp_path: Path) -> PaperlessLanceVectorStore: - return PaperlessLanceVectorStore(uri=str(tmp_path / "idx")) - - def test_maybe_create_ann_index_noop_below_threshold( - self, - store: PaperlessLanceVectorStore, - ) -> None: - store.add([_node("1-0", "1", "a", 0.1)]) - # Threshold far above row count -> no index attempted, no error. - store.maybe_create_ann_index(min_rows=1000) - # Still queryable. - result = store.query( - VectorStoreQuery(query_embedding=[0.1] * DIM, similarity_top_k=1), - ) - assert len(result.nodes) == 1 - - def test_maybe_create_ann_index_non_divisible_dim_falls_back( - self, - store: PaperlessLanceVectorStore, - ) -> None: - # DIM=8 is not divisible by the PQ default sub-vectors; must not raise - # and must leave the table queryable (IVF_FLAT fallback or skipped). - for i in range(40): - store.add([_node(f"1-{i}", "1", f"t{i}", float(i))]) - store.maybe_create_ann_index(min_rows=10) - result = store.query( - VectorStoreQuery(query_embedding=[1.0] * DIM, similarity_top_k=3), - ) - assert len(result.nodes) == 3 - - def test_compact_reduces_to_single_version( - self, - store: PaperlessLanceVectorStore, - ) -> None: - for i in range(5): - store.add([_node(f"1-{i}", "1", f"t{i}", float(i))]) - assert len(store.client.open_table("documents").list_versions()) > 1 - store.compact(retention_seconds=0) - assert len(store.client.open_table("documents").list_versions()) == 1 -``` - -- [ ] **Step 2: Run to verify they fail** - -Run: `uv run pytest src/paperless_ai/tests/test_vector_store.py::TestPaperlessLanceVectorStoreMaintenance -v` -Expected: FAIL (`maybe_create_ann_index` / `compact` not defined). - -- [ ] **Step 3: Implement maintenance methods** - -Add to the top of `src/paperless_ai/vector_store.py` (module constants, after `DEFAULT_TABLE_NAME`): - -```python -# Below this many chunks, LanceDB's exact (brute-force) search is sufficient and -# faster than building an ANN index (per LanceDB guidance, ~100K vectors). -ANN_INDEX_MIN_ROWS = 100_000 -# IVF_PQ default; num_sub_vectors must evenly divide the embedding dimension. -ANN_PQ_SUB_VECTORS = 96 -``` - -Add these methods to `PaperlessLanceVectorStore`: - -```python - def _has_vector_index(self) -> bool: - try: - return any( - "vector" in (getattr(idx, "columns", []) or []) - for idx in self._table.list_indices() - ) - except Exception: # pragma: no cover - older lancedb without list_indices - return False - - def maybe_create_ann_index(self, min_rows: int = ANN_INDEX_MIN_ROWS) -> None: - """Best-effort: build an IVF index once the table is large enough. - - IVF_PQ is used when ``num_sub_vectors`` divides the embedding dimension, - otherwise IVF_FLAT (no divisor constraint). Any failure is logged and - leaves the table on exact search, which is always correct. - """ - if self._table is None: - return - rows = self._table.count_rows() - if rows < min_rows or self._has_vector_index(): - return - num_partitions = max(1, rows // 4096) - # Embedding dim from the schema's fixed-size list column. - dim = self._table.schema.field("vector").type.list_size - try: - if dim % ANN_PQ_SUB_VECTORS == 0: - self._table.create_index( - metric="l2", - num_partitions=num_partitions, - num_sub_vectors=ANN_PQ_SUB_VECTORS, - index_type="IVF_PQ", - ) - else: - self._table.create_index( - metric="l2", - num_partitions=num_partitions, - index_type="IVF_FLAT", - ) - except Exception as e: # pragma: no cover - depends on data/dim - logger.warning("Skipping ANN index creation: %s", e) - - def ensure_document_id_scalar_index(self) -> None: - """Create a scalar index on the filter column (never on the merge key - ``id`` — see LanceDB #3177).""" - if self._table is None: - return - try: - self._table.create_scalar_index("document_id", replace=True) - except Exception as e: # pragma: no cover - logger.warning("Skipping document_id scalar index: %s", e) - - def compact(self, retention_seconds: int) -> None: - """Compact fragments and prune old MVCC versions in one call.""" - if self._table is None: - return - from datetime import timedelta - - self._table.optimize(cleanup_older_than=timedelta(seconds=retention_seconds)) -``` - -> **Note for the implementer:** verify `list_size` is the right attribute for a `pyarrow` fixed-size list on the installed pyarrow (`pa.list_(pa.float32(), 8).list_size == 8`). If the installed pyarrow exposes it differently, adjust the accessor accordingly (this same accessor is used by `vector_dim()` in Task 2). - -- [ ] **Step 4: Run to verify they pass** - -Run: `uv run pytest src/paperless_ai/tests/test_vector_store.py::TestPaperlessLanceVectorStoreMaintenance -v` -Expected: PASS (all three). - -- [ ] **Step 5: Add the upsert-after-optimize regression test (#3177 guard)** - -Append to `TestPaperlessLanceVectorStoreMaintenance`: - -```python - def test_upsert_after_optimize_with_scalar_index( - self, - store: PaperlessLanceVectorStore, - ) -> None: - store.add( - [ - _node("1-0", "1", "old0", 0.1), - _node("1-1", "1", "old1", 0.2), - _node("1-2", "1", "old2", 0.3), - _node("2-0", "2", "keep", 0.9), - ], - ) - store.ensure_document_id_scalar_index() - store.compact(retention_seconds=0) - - store.upsert_document("1", [_node("1-0", "1", "new0", 0.1)]) - - table = store.client.open_table("documents") - doc1 = sorted( - r["id"] for r in table.search().where("document_id = '1'").to_list() - ) - assert doc1 == ["1-0"] - assert table.count_rows() == 2 -``` - -- [ ] **Step 6: Run the full maintenance class** - -Run: `uv run pytest src/paperless_ai/tests/test_vector_store.py::TestPaperlessLanceVectorStoreMaintenance -v` -Expected: PASS (all four). - -- [ ] **Step 7: Lint and commit** - -```bash -ruff check src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -ruff format src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -git add src/paperless_ai/vector_store.py src/paperless_ai/tests/test_vector_store.py -git commit -m "feat(ai): ANN index threshold, scalar index, and compaction - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 5: Node identity — set `LlamaDocument.id_` to the document id - -**Files:** - -- Modify: `src/paperless_ai/indexing.py` (the `build_document_node` function, around `indexing.py:132-149`) -- Test: `src/paperless_ai/tests/test_ai_indexing.py` - -- [ ] **Step 1: Write the failing test** - -Add to `src/paperless_ai/tests/test_ai_indexing.py`: - -```python -@pytest.mark.django_db -def test_build_document_node_sets_ref_doc_id(real_document) -> None: - nodes = indexing.build_document_node(real_document) - assert nodes - for node in nodes: - assert node.ref_doc_id == str(real_document.id) -``` - -- [ ] **Step 2: Run to verify it fails** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_sets_ref_doc_id -v` -Expected: FAIL — `ref_doc_id` is a random uuid, not `str(real_document.id)`. - -- [ ] **Step 3: Set the LlamaDocument id** - -In `src/paperless_ai/indexing.py`, in `build_document_node`, change the `LlamaDocument(...)` construction to set `id_`: - -```python - doc = LlamaDocument( - id_=str(document.id), - text=text, - metadata=metadata, - excluded_embed_metadata_keys=list(metadata.keys()), - ) -``` - -- [ ] **Step 4: Run to verify it passes** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py::test_build_document_node_sets_ref_doc_id -v` -Expected: PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py -git commit -m "feat(ai): tie LlamaDocument id to the paperless document id - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 6: Vector-store factory + load/build in `indexing.py` - -**Files:** - -- Modify: `src/paperless_ai/indexing.py` — replace `get_or_create_storage_context` and `load_or_build_index`; add `get_vector_store` and `LLM_INDEX_TABLE`. -- Test: `src/paperless_ai/tests/test_ai_indexing.py` - -- [ ] **Step 1: Write the failing test** - -Add to `src/paperless_ai/tests/test_ai_indexing.py`: - -```python -@pytest.mark.django_db -def test_get_vector_store_roundtrip( - temp_llm_index_dir, - mock_embed_model, -) -> None: - from llama_index.core.vector_stores.types import VectorStoreQuery - - from paperless_ai.vector_store import PaperlessLanceVectorStore - - store = indexing.get_vector_store() - assert isinstance(store, PaperlessLanceVectorStore) -``` - -- [ ] **Step 2: Run to verify it fails** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py::test_get_vector_store_roundtrip -v` -Expected: FAIL — `indexing.get_vector_store` not defined. - -- [ ] **Step 3: Add the factory and rewrite the index builder** - -In `src/paperless_ai/indexing.py`: - -1. Add near the top (after `logger = ...`). **Do not** add a top-level `import` of `vector_store` — only a constant and a `TYPE_CHECKING`-only hint (see the lazy-import constraint): - -```python -LLM_INDEX_TABLE = "documents" -``` - -There is already a `from typing import TYPE_CHECKING` block at the top of `indexing.py`; add the adapter to it for type hints only: - -```python -if TYPE_CHECKING: - from paperless_ai.vector_store import PaperlessLanceVectorStore -``` - -2. Replace the entire `get_or_create_storage_context(...)` function with (note the **function-local** import of `vector_store` and the string type hint): - -```python -def get_vector_store() -> "PaperlessLanceVectorStore": - """Open (or lazily create) the LanceDB-backed vector store. - - Imports ``vector_store`` lazily so that importing ``indexing`` (which - ``documents.tasks`` does at module top) never drags in lancedb/llama_index. - """ - from paperless_ai.vector_store import PaperlessLanceVectorStore - - settings.LLM_INDEX_DIR.mkdir(parents=True, exist_ok=True) - return PaperlessLanceVectorStore( - uri=str(settings.LLM_INDEX_DIR), - table_name=LLM_INDEX_TABLE, - ) -``` - -3. Replace `load_or_build_index(...)` with: - -```python -def load_or_build_index(nodes=None): - """Load the VectorStoreIndex backed by the LanceDB store. - - With ``stores_text=True`` the index runs off the vector store alone — no - docstore or index store. ``nodes`` is accepted for signature compatibility - but unused; the store is the source of truth. - """ - import llama_index.core.settings as llama_settings - from llama_index.core import VectorStoreIndex - - embed_model = get_embedding_model() - llama_settings.Settings.embed_model = embed_model - vector_store = get_vector_store() - return VectorStoreIndex.from_vector_store( - vector_store=vector_store, - embed_model=embed_model, - ) -``` - -4. Replace `vector_store_file_exists()` (it must use the store before Task 7 relies on it): - -```python -def vector_store_file_exists() -> bool: - """True when the LanceDB table exists.""" - return get_vector_store().table_exists() -``` - -5. Remove the now-unused imports for `StorageContext`, `SimpleDocumentStore`, `SimpleIndexStore`, and `faiss`. **Keep `shutil`** — it is used by Task 9's migration cleanup. - -- [ ] **Step 4: Run to verify it passes** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py::test_get_vector_store_roundtrip -v` -Expected: PASS. - -- [ ] **Step 5: Write the lazy-import guard test** - -This guards the hard constraint: importing `documents.tasks` (the light path that management commands traverse) must not pull in any AI library. It runs in a **subprocess** because the pytest process has already imported these libs via other tests. - -Create `src/paperless_ai/tests/test_lazy_imports.py`: - -```python -import subprocess -import sys - - -class TestLazyAiImports: - def test_importing_tasks_does_not_load_ai_libraries(self) -> None: - code = ( - "import os, django, sys\n" - "os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'paperless.settings')\n" - "django.setup()\n" - "import documents.tasks # noqa: F401\n" - "leaked = [m for m in ('lancedb', 'pyarrow', 'llama_index') " - "if m in sys.modules]\n" - "assert not leaked, f'AI libraries leaked into the light path: {leaked}'\n" - ) - result = subprocess.run( - [sys.executable, "-c", code], - capture_output=True, - text=True, - cwd="src", - ) - assert result.returncode == 0, result.stdout + result.stderr -``` - -- [ ] **Step 6: Run the guard test** - -Run: `uv run pytest src/paperless_ai/tests/test_lazy_imports.py -v` -Expected: PASS. If it FAILS, find the offending top-level import (`git grep -n "^from llama_index\|^import lancedb\|^import pyarrow\|^from paperless_ai.vector_store" src/paperless_ai src/documents`) and make it function-local. - -- [ ] **Step 7: Commit** - -```bash -git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py src/paperless_ai/tests/test_lazy_imports.py -git commit -m "refactor(ai): build the index from the LanceDB store alone (lazy import) - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 7: Rewire add / update / remove / rebuild - -**Files:** - -- Modify: `src/paperless_ai/indexing.py` — `llm_index_add_or_update_document`, `llm_index_remove_document`, `update_llm_index`; delete `remove_document_docstore_nodes`. -- Test: `src/paperless_ai/tests/test_ai_indexing.py` - -- [ ] **Step 1: Write the failing tests (CRUD against the real store)** - -Add to `src/paperless_ai/tests/test_ai_indexing.py`: - -```python -@pytest.mark.django_db -def test_add_then_remove_document( - temp_llm_index_dir, - mock_embed_model, - real_document, -) -> None: - indexing.llm_index_add_or_update_document(real_document) - store = indexing.get_vector_store() - table = store.client.open_table(indexing.LLM_INDEX_TABLE) - assert table.count_rows() >= 1 - - indexing.llm_index_remove_document(real_document) - assert store.client.open_table(indexing.LLM_INDEX_TABLE).count_rows() == 0 - - -@pytest.mark.django_db -def test_update_shrinks_chunks_without_orphans( - temp_llm_index_dir, - mock_embed_model, - real_document, -) -> None: - real_document.content = "word " * 4000 # many chunks - real_document.save() - indexing.llm_index_add_or_update_document(real_document) - store = indexing.get_vector_store() - big = store.client.open_table(indexing.LLM_INDEX_TABLE).count_rows() - - real_document.content = "short" # one chunk - real_document.save() - indexing.llm_index_add_or_update_document(real_document) - - rows = store.client.open_table(indexing.LLM_INDEX_TABLE).count_rows() - assert rows < big - assert rows >= 1 -``` - -- [ ] **Step 2: Run to verify they fail** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py -k "add_then_remove or shrinks" -v` -Expected: FAIL (current implementation still uses the docstore path). - -- [ ] **Step 3: Rewrite the functions** - -In `src/paperless_ai/indexing.py` (per the lazy-import constraint, **do not** add a top-level `llama_index` import — `MetadataMode` is imported inside each function that uses it): - -1. **Delete** the `remove_document_docstore_nodes(...)` function entirely. - -2. Replace `llm_index_add_or_update_document`: - -```python -def llm_index_add_or_update_document(document: Document): - """Add or atomically replace a document's chunks in the LLM index.""" - from llama_index.core.schema import MetadataMode - - new_nodes = build_document_node(document, chunk_size=get_rag_chunk_size()) - - embed_model = get_embedding_model() - for node in new_nodes: - node.embedding = embed_model.get_text_embedding( - node.get_content(metadata_mode=MetadataMode.EMBED), - ) - - with FileLock(_index_lock_path()): - store = get_vector_store() - store.upsert_document(str(document.id), new_nodes) - store.ensure_document_id_scalar_index() -``` - -> Note: `upsert_document` with an empty `new_nodes` list deletes the document (handles the "no indexable content" case the old code logged-and-skipped). - -3. Replace `llm_index_remove_document`: - -```python -def llm_index_remove_document(document: Document): - """Remove a document's chunks from the LLM index.""" - with FileLock(_index_lock_path()): - store = get_vector_store() - store.delete(str(document.id)) -``` - -4. Rewrite `update_llm_index` — both the rebuild and incremental branches. The rebuild path drops/recreates the table and bulk-inserts; the incremental path upserts changed documents (compare `modified`). Replace the function body with: - -```python -def update_llm_index( - *, - iter_wrapper: IterWrapper[Document] = identity, - rebuild=False, -) -> str: - """Rebuild or incrementally update the LLM index.""" - from llama_index.core.schema import MetadataMode - - documents = Document.objects.all() - if not documents.exists(): - logger.warning("No documents found to index.") - if not rebuild and not vector_store_file_exists(): - return "No documents found to index." - - chunk_size = AIConfig().llm_embedding_chunk_size - embed_model = get_embedding_model() - - with FileLock(_index_lock_path()): - if rebuild or not vector_store_file_exists(): - (settings.LLM_INDEX_DIR / "meta.json").unlink(missing_ok=True) - logger.info("Rebuilding LLM index.") - store = get_vector_store() - store.drop_table() # defined in Task 2; bulk-insert into a fresh table - for document in iter_wrapper(documents): - nodes = build_document_node(document, chunk_size=chunk_size) - for node in nodes: - node.embedding = embed_model.get_text_embedding( - node.get_content(metadata_mode=MetadataMode.EMBED), - ) - store.add(nodes) - msg = "LLM index rebuilt successfully." - else: - store = get_vector_store() - existing = { - str(row["document_id"]): json.loads(row["node_content"]) - for row in _iter_existing_modified(store) - } - changed = 0 - for document in iter_wrapper(documents): - doc_id = str(document.id) - node_meta = existing.get(doc_id) - if node_meta is not None: - stored_modified = node_meta.get("modified") - if stored_modified == document.modified.isoformat(): - continue - nodes = build_document_node(document, chunk_size=chunk_size) - for node in nodes: - node.embedding = embed_model.get_text_embedding( - node.get_content(metadata_mode=MetadataMode.EMBED), - ) - store.upsert_document(doc_id, nodes) - changed += 1 - msg = ( - "LLM index updated successfully." - if changed - else "No changes detected in LLM index." - ) - - store.ensure_document_id_scalar_index() - store.maybe_create_ann_index() - store.compact(retention_seconds=get_llm_index_compaction_retention()) - return msg -``` - -5. Add the helpers used above near the other small helpers in `indexing.py`: - -```python -def _iter_existing_modified(store) -> list[dict]: - """One representative row per document_id, for modified-time comparison.""" - table_name = LLM_INDEX_TABLE - if table_name not in store.client.table_names(): - return [] - seen: dict[str, dict] = {} - for row in store.client.open_table(table_name).search().to_list(): - seen.setdefault(str(row["document_id"]), row) - return list(seen.values()) - - -def get_llm_index_compaction_retention() -> int: - """Seconds of MVCC version history to keep during compaction.""" - return 60 * 60 # 1 hour: safe for in-flight readers, reclaims daily -``` - -6. Ensure `import json` is present at the top of `indexing.py` (it is used by `_iter_existing_modified`). - -- [ ] **Step 4: Run to verify they pass** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py -k "add_then_remove or shrinks" -v` -Expected: PASS. - -- [ ] **Step 5: Run the full indexing test module** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py -v` -Expected: PASS, except tests that asserted on the old docstore internals — fix or delete those in Task 11. - -- [ ] **Step 6: Commit** - -```bash -git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py -git commit -m "refactor(ai): add/update/remove/rebuild via LanceDB upsert + delete - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 8: `query_similar_documents` via metadata filter - -**Files:** - -- Modify: `src/paperless_ai/indexing.py` — `query_similar_documents` (around `indexing.py:394-463`) -- Test: `src/paperless_ai/tests/test_ai_indexing.py` - -- [ ] **Step 1: Write the failing test** - -Add to `src/paperless_ai/tests/test_ai_indexing.py`: - -```python -@pytest.mark.django_db -def test_query_similar_documents_respects_allowed_ids( - temp_llm_index_dir, - mock_embed_model, -) -> None: - from documents.tests.factories import DocumentFactory - - a = DocumentFactory.create(content="alpha shared content here") - b = DocumentFactory.create(content="beta shared content here") - c = DocumentFactory.create(content="gamma shared content here") - for doc in (a, b, c): - indexing.llm_index_add_or_update_document(doc) - - results = indexing.query_similar_documents(a, document_ids=[b.id]) - - assert all(doc.id == b.id for doc in results) -``` - -- [ ] **Step 2: Run to verify it fails** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py::test_query_similar_documents_respects_allowed_ids -v` -Expected: FAIL (current implementation scans the docstore via `index.docstore.docs`, which no longer exists). - -- [ ] **Step 3: Rewrite `query_similar_documents`** - -Replace the function body in `src/paperless_ai/indexing.py`: - -```python -def query_similar_documents( - document: Document, - top_k: int = 5, - document_ids: Iterable[int | str] | None = None, -) -> list[Document]: - """Return up to ``top_k`` Documents most similar to ``document``.""" - allowed_document_ids = normalize_document_ids(document_ids) - if allowed_document_ids is not None and not allowed_document_ids: - return [] - - if not vector_store_file_exists(): - queue_llm_index_update_if_needed( - rebuild=False, - reason="LLM index not found for similarity query.", - ) - return [] - - from llama_index.core.retrievers import VectorIndexRetriever - from llama_index.core.vector_stores.types import FilterOperator - from llama_index.core.vector_stores.types import MetadataFilter - from llama_index.core.vector_stores.types import MetadataFilters - - index = load_or_build_index() - - filters = None - if allowed_document_ids is not None: - filters = MetadataFilters( - filters=[ - MetadataFilter( - key="document_id", - operator=FilterOperator.IN, - value=sorted(allowed_document_ids), - ), - ], - ) - - retriever = VectorIndexRetriever( - index=index, - similarity_top_k=top_k, - filters=filters, - ) - - config = AIConfig() - query_text = truncate_content( - (document.title or "") + "\n" + (document.content or ""), - chunk_size=config.llm_embedding_chunk_size, - context_size=config.llm_context_size, - ) - results = retriever.retrieve(query_text) - - retrieved_document_ids: list[int] = [] - for node in results: - document_id = node.metadata.get("document_id") - if document_id is None: - continue - normalized = str(document_id) - if allowed_document_ids is not None and normalized not in allowed_document_ids: - continue - try: - retrieved_document_ids.append(int(normalized)) - except ValueError: - logger.warning( - "Skipping LLM index result with invalid document_id %r.", - document_id, - ) - - return list(Document.objects.filter(pk__in=retrieved_document_ids)) -``` - -- [ ] **Step 4: Run to verify it passes** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py::test_query_similar_documents_respects_allowed_ids -v` -Expected: PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/paperless_ai/indexing.py src/paperless_ai/tests/test_ai_indexing.py -git commit -m "refactor(ai): query_similar_documents via metadata filter - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 9: Dimension guard, migration cleanup, and embedding helper - -> Note: the store primitives `table_exists` / `vector_dim` / `drop_table` were added in Task 2, and `vector_store_file_exists()` was rewritten in Task 6. This task adds only the indexing-level migration and dimension-mismatch guard plus the embedding helper they need. - -**Files:** - -- Modify: `src/paperless_ai/indexing.py` — add migration cleanup + dimension-mismatch guard; wire them into `update_llm_index`. -- Modify: `src/paperless_ai/embedding.py` — add `current_embedding_dim`. -- Test: `src/paperless_ai/tests/test_ai_indexing.py` - -- [ ] **Step 1: Write the failing test** - -Add to `test_ai_indexing.py`: - -```python -@pytest.mark.django_db -def test_migration_wipes_stale_faiss_files(temp_llm_index_dir) -> None: - stale = temp_llm_index_dir / "default__vector_store.json" - stale.write_text("{}") - indexing.migrate_stale_faiss_index() - assert not stale.exists() -``` - -- [ ] **Step 2: Run to verify it fails** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py::test_migration_wipes_stale_faiss_files -v` -Expected: FAIL (`indexing.migrate_stale_faiss_index` not defined). - -- [ ] **Step 3: Add the embedding helper** - -In `src/paperless_ai/embedding.py`, add: - -```python -def current_embedding_dim() -> int: - """Embedding dimension for the configured model (probes if not cached).""" - return get_embedding_dim() -``` - -- [ ] **Step 4: Add migration cleanup + dimension guard** - -In `src/paperless_ai/indexing.py` (note: `vector_store_file_exists` was already rewritten in Task 6 — do not redefine it): - -```python -def migrate_stale_faiss_index() -> None: - """Remove a pre-LanceDB FAISS index directory so it is rebuilt fresh.""" - stale_marker = settings.LLM_INDEX_DIR / "default__vector_store.json" - if stale_marker.exists(): - logger.info("Removing stale FAISS LLM index; it will be rebuilt.") - shutil.rmtree(settings.LLM_INDEX_DIR, ignore_errors=True) - settings.LLM_INDEX_DIR.mkdir(parents=True, exist_ok=True) - - -def embedding_dim_mismatch() -> bool: - """True when the stored table's vector dim differs from the current model.""" - store = get_vector_store() - stored = store.vector_dim() - if stored is None: - return False - from paperless_ai.embedding import current_embedding_dim - - return stored != current_embedding_dim() -``` - -Then wire them into `update_llm_index` — add this at the very top of the function body, **before** the `with FileLock(...)` block (the `migrate_stale_faiss_index` call from Task 7's `update_llm_index` rewrite, if already present, should match this; otherwise add it now): - -```python - migrate_stale_faiss_index() - if not rebuild and vector_store_file_exists() and embedding_dim_mismatch(): - logger.warning("Embedding dimension changed; forcing LLM index rebuild.") - rebuild = True -``` - -- [ ] **Step 5: Run to verify it passes** - -Run: `uv run pytest src/paperless_ai/tests/test_ai_indexing.py::test_migration_wipes_stale_faiss_files -v` -Expected: PASS. - -- [ ] **Step 6: Commit** - -```bash -git add src/paperless_ai/indexing.py src/paperless_ai/embedding.py src/paperless_ai/tests/test_ai_indexing.py -git commit -m "feat(ai): dimension guard and FAISS index migration - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 10: Chat — stock retriever with filters - -**Files:** - -- Modify: `src/paperless_ai/chat.py` — delete `_get_document_filtered_retriever`; rewrite `_stream_chat_with_documents`. -- Test: `src/paperless_ai/tests/test_chat.py` - -- [ ] **Step 1: Read the existing chat tests** - -Run: `uv run pytest src/paperless_ai/tests/test_chat.py -v` -Expected: baseline of current passes (note which tests reference `_get_document_filtered_retriever` or FAISS internals; those will be updated). - -- [ ] **Step 2: Write/adjust the failing test** - -Add to `src/paperless_ai/tests/test_chat.py` (a class-grouped, mocker-based test): - -```python -import pytest - -from paperless_ai import chat - - -@pytest.mark.django_db -class TestStreamChatRetrieval: - def test_no_nodes_yields_no_content_message( - self, - temp_llm_index_dir, - mock_embed_model, - mocker, - ) -> None: - from documents.tests.factories import DocumentFactory - - doc = DocumentFactory.create(content="hello world") - # Nothing indexed for this document yet. - out = list(chat.stream_chat_with_documents("question?", [doc])) - assert chat.CHAT_NO_CONTENT_MESSAGE in out -``` - -(`mock_embed_model` is the fixture in `test_ai_indexing.py`; move it into `conftest.py` in Step 4 so both modules can use it.) - -- [ ] **Step 3: Run to verify it fails or errors on the docstore reach-in** - -Run: `uv run pytest src/paperless_ai/tests/test_chat.py::TestStreamChatRetrieval -v` -Expected: FAIL/ERROR — current `_stream_chat_with_documents` reads `index.docstore.docs`, which no longer exists. - -- [ ] **Step 4: Move `mock_embed_model` + `FakeEmbedding` to conftest** - -Cut the `FakeEmbedding` class and `mock_embed_model` fixture from `test_ai_indexing.py` and paste them into `src/paperless_ai/tests/conftest.py` (so both test modules share them). Leave `temp_llm_index_dir` as-is. - -- [ ] **Step 5: Rewrite chat** - -In `src/paperless_ai/chat.py`: - -1. **Delete** `_get_document_filtered_retriever(...)` entirely. - -2. Rewrite `_stream_chat_with_documents`: - -```python -def _stream_chat_with_documents(query_str: str, documents: list[Document]): - from llama_index.core.prompts import PromptTemplate - from llama_index.core.query_engine import RetrieverQueryEngine - from llama_index.core.response_synthesizers import get_response_synthesizer - from llama_index.core.retrievers import VectorIndexRetriever - from llama_index.core.vector_stores.types import FilterOperator - from llama_index.core.vector_stores.types import MetadataFilter - from llama_index.core.vector_stores.types import MetadataFilters - - client = AIClient() - index = load_or_build_index() - - doc_ids = [str(doc.pk) for doc in documents] - filters = MetadataFilters( - filters=[ - MetadataFilter( - key="document_id", - operator=FilterOperator.IN, - value=doc_ids, - ), - ], - ) - - # No indexed content for these documents -> bail early. - if not index.vector_store.get_nodes(filters=filters): - logger.warning("No nodes found for the given documents.") - yield CHAT_NO_CONTENT_MESSAGE - return - - retriever = VectorIndexRetriever( - index=index, - similarity_top_k=CHAT_RETRIEVER_TOP_K, - filters=filters, - ) - - top_nodes = retriever.retrieve(query_str) - if len(top_nodes) == 0: - logger.warning("Retriever returned no nodes for the given documents.") - yield CHAT_NO_CONTENT_MESSAGE - return - - references = _get_document_references(documents, top_nodes) - - prompt_template = PromptTemplate(template=CHAT_PROMPT_TMPL) - response_synthesizer = get_response_synthesizer( - llm=client.llm, - prompt_helper=get_rag_prompt_helper(), - text_qa_template=prompt_template, - streaming=True, - ) - query_engine = RetrieverQueryEngine.from_args( - retriever=retriever, - llm=client.llm, - response_synthesizer=response_synthesizer, - streaming=True, - ) - - logger.debug("Document chat query: %s", query_str) - response_stream = query_engine.query(query_str) - for chunk in response_stream.response_gen: - yield chunk - sys.stdout.flush() - - if references: - yield _format_chat_metadata_trailer(references) -``` - -- [ ] **Step 6: Run to verify it passes** - -Run: `uv run pytest src/paperless_ai/tests/test_chat.py -v` -Expected: PASS (update or remove any remaining test that asserted on `DocumentFilteredFaissRetriever` / FAISS internals). - -- [ ] **Step 7: Lint and commit** - -```bash -ruff check src/paperless_ai/chat.py -ruff format src/paperless_ai/chat.py -git add src/paperless_ai/chat.py src/paperless_ai/tests/test_chat.py src/paperless_ai/tests/conftest.py src/paperless_ai/tests/test_ai_indexing.py -git commit -m "refactor(ai): chat uses a stock filtered retriever - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 11: Sweep remaining FAISS references and fix stale tests - -**Files:** - -- Modify: `src/paperless_ai/tests/test_ai_indexing.py`, `src/paperless_ai/tests/test_chat.py` (delete/adjust FAISS-internal assertions) -- Modify: any remaining references in `src/paperless_ai/` to `FaissVectorStore`, `SimpleDocumentStore`, `SimpleIndexStore`, `get_or_create_storage_context`, `remove_document_docstore_nodes`. - -- [ ] **Step 1: Find remaining references** - -Run: `git grep -n "Faiss\|FaissVectorStore\|SimpleDocumentStore\|SimpleIndexStore\|get_or_create_storage_context\|remove_document_docstore_nodes\|_faiss_index\|index_struct.nodes_dict\|docstore.docs" src/` -Expected: matches only in tests to be updated, or none. - -- [ ] **Step 2: Update or delete each stale reference** - -For each match in a test, replace the docstore/FAISS-internal assertion with the equivalent store-level assertion (`store.client.open_table(...).count_rows()`, `store.query(...)`, `store.get_nodes(...)`). Delete tests that only validated old internals (e.g. a test asserting `remove_document_docstore_nodes` left FAISS vectors behind). - -- [ ] **Step 3: Run the whole AI suite** - -Run: `uv run pytest src/paperless_ai/ -v` -Expected: PASS, no references to removed symbols. - -- [ ] **Step 4: Run the documents task tests that touch the LLM index** - -Run: `uv run pytest src/documents/tests/test_tasks.py -k "llm or index" -v` -Expected: PASS. - -- [ ] **Step 5: Commit** - -```bash -git add src/paperless_ai/tests/ -git commit -m "test(ai): drop FAISS-internal assertions - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 12: Full-suite verification and config docs - -**Files:** - -- Modify: `paperless.conf.example` (document the unchanged `PAPERLESS_LLM_INDEX_TASK_CRON`; no new vars required — ANN threshold and compaction retention are internal constants). - -- [ ] **Step 1: Confirm no new env vars needed** - -The ANN threshold (`ANN_INDEX_MIN_ROWS`) and compaction retention (`get_llm_index_compaction_retention`) are internal constants per the spec. No `paperless.conf.example` change is required unless a maintainer wants them tunable. Skip unless requested. - -- [ ] **Step 2: Run the full backend AI + tasks suite** - -Run: `uv run pytest src/paperless_ai/ src/documents/tests/test_tasks.py -v` -Expected: PASS. - -- [ ] **Step 3: Run the app-config API test (it referenced the index status)** - -Run: `uv run pytest src/documents/tests/test_api_app_config.py -v` -Expected: PASS. - -- [ ] **Step 4: Lint the whole AI package** - -Run: `ruff check src/paperless_ai && ruff format --check src/paperless_ai` -Expected: clean. - -- [ ] **Step 5: Verify the consume → index path manually (smoke)** - -Run a quick smoke per the spec: with `PAPERLESS_LLM_INDEX_ENABLED` on, consume a document and confirm `llm_index_add_or_update_document` writes a row (the test in Task 7 covers this in CI; this is an optional manual smoke). - -- [ ] **Step 6: Final commit / branch is ready for PR** - -```bash -git add -A -git commit -m "chore(ai): finalize LanceDB vector store migration - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Task 13: Type-check the new/changed AI code against pyrefly - -**Goal:** the code this branch adds/changes passes `pyrefly` cleanly **without growing -`.pyrefly-baseline.json`**. The baseline (~600 KB) suppresses pre-existing repo errors; our -new code must not add to it. Run this **last**, once all implementation tasks are done, so -every new file/symbol exists. - -**Files (likely to need annotations/fixes):** - -- `src/paperless_ai/vector_store.py`, `src/paperless_ai/indexing.py`, - `src/paperless_ai/chat.py`, `src/paperless_ai/embedding.py`, and the new test modules. - -**Environment:** pyrefly needs the dependencies installed to resolve third-party types, so -run it **on the Linux VM** (where the venv has `lancedb`/`pyarrow`/`llama_index`). The -`[tool.pyrefly]` config already sets `search-path = ["src"]`, `python-platform = "linux"`, -and `baseline = ".pyrefly-baseline.json"`, so `pyrefly check` from the repo root applies -the baseline automatically and reports only non-baselined (i.e. new) errors. - -- [ ] **Step 1: Run pyrefly on the VM and capture NEW errors** - -```bash -tar czf - src pyproject.toml uv.lock .pyrefly-baseline.json \ - | ssh -o BatchMode=yes -p 2244 trenton@localhost 'tar xzf - -C ~/projects/paperless-ngx' -ssh -o BatchMode=yes -p 2244 trenton@localhost \ - 'bash -lc "cd ~/projects/paperless-ngx && uv run pyrefly check"' -``` - -Expected at first: a list of errors located in the changed `paperless_ai` files (anything -already in the baseline is suppressed). Note each `file:line` + error code. - -- [ ] **Step 2: Fix the type errors at the source** - -Prefer real fixes over suppressions: - -- Add/repair annotations on our functions, fixtures, and the adapter methods so signatures - match `BasePydanticVectorStore` (e.g. `Sequence[BaseNode]`, `list[str]`, the - `MetadataFilters | None` params, `VectorStoreQueryResult` return). -- Annotate `PrivateAttr` fields and the lazy `get_vector_store() -> "PaperlessLanceVectorStore"` - (string annotation under `TYPE_CHECKING`). -- For genuine third-party stub gaps (`lancedb`/`pyarrow` ship little/no type info; some - `llama_index` returns are dynamic), use a **targeted, commented** suppression on that exact - line — `# type: ignore[] # lancedb has no type stubs` — not a blanket file-level - ignore. - -- [ ] **Step 3: Do NOT grow the baseline** - -Do not regenerate or append `.pyrefly-baseline.json`. The goal is zero new baseline entries. -If — and only if — an error is genuinely impossible to fix or suppress inline (rare), stop -and report it as DONE_WITH_CONCERNS describing the specific error, rather than silently -baselining it. - -- [ ] **Step 4: Re-run until clean** - -Re-run the Step 1 command. Expected: no errors in the `paperless_ai` files we touched (the -overall run still passes via the unchanged baseline for the rest of the repo). - -- [ ] **Step 5: Lint and commit** - -```bash -ruff check src/paperless_ai -ruff format src/paperless_ai -git add src/paperless_ai -git commit -m "types(ai): pass pyrefly for the LanceDB vector store code - -Co-Authored-By: Claude Opus 4.8 (1M context) " -``` - ---- - -## Self-Review notes (for the implementer) - -- **Lazy imports are a hard requirement** (see the constraint section). After Tasks 6, 7, and 10, the guard test (`test_lazy_imports.py`) must stay green: importing `documents.tasks` must not load `lancedb` / `pyarrow` / `llama_index`. Every `llama_index` symbol in `indexing.py`/`chat.py` (retrievers, filters, `MetadataMode`) and the `vector_store` import itself must be function-local; only `vector_store.py` and test modules import these at top level. - -- **`MetadataMode.EMBED`** is passed to `get_content` when embedding in the add/incremental/rebuild paths. Because `build_document_node` sets `excluded_embed_metadata_keys` to every metadata key, `EMBED` yields just the chunk text — exactly what llama-index's own embedding pipeline would feed the model, preserving current behavior. The import `from llama_index.core.schema import MetadataMode` is added in Task 7. -- **`list_size`** is the pyarrow attribute for a fixed-size list's length, used by `vector_dim()` (Task 2) and `maybe_create_ann_index()` (Task 4). Confirm on the installed pyarrow (`pa.list_(pa.float32(), 8).list_size`); adjust the accessor in both places if the version differs. -- **`merge_insert` match key `id` must never get a scalar index** (LanceDB #3177). The only scalar index is on `document_id` (`ensure_document_id_scalar_index`). Task 4's `test_upsert_after_optimize_with_scalar_index` guards this. -- **`embed_model.get_text_embedding`** is called per node in the rebuild/incremental/add paths because we bypass `index.insert_nodes` and write to the store directly. This matches the proven probe. For large rebuilds, consider batching with `get_text_embedding_batch` as a later optimization (YAGNI for now). -- **Compaction retention** defaults to 1 hour (`get_llm_index_compaction_retention`); tests call `compact(retention_seconds=0)` directly to force a single version. diff --git a/docs/superpowers/specs/2026-06-02-lancedb-vector-store-design.md b/docs/superpowers/specs/2026-06-02-lancedb-vector-store-design.md deleted file mode 100644 index 722709a59..000000000 --- a/docs/superpowers/specs/2026-06-02-lancedb-vector-store-design.md +++ /dev/null @@ -1,448 +0,0 @@ -# Replace the FAISS vector store with LanceDB - -**Date:** 2026-06-02 -**Status:** Design — pending implementation plan -**Area:** `src/paperless_ai/` (AI / LLM index feature) - -## Problem - -The optional AI feature stores document embeddings in a llama-index `StorageContext` -made of three file-backed components persisted under `DATA_DIR/llm_index/`: - -| Component | Role | Backing | -| ---------------------------------------- | ---------------------------------------------------- | ----------------- | -| `FaissVectorStore` (`faiss.IndexFlatL2`) | the vectors | binary faiss file | -| `SimpleDocumentStore` | node text + metadata (source of truth for retrieval) | one large JSON | -| `SimpleIndexStore` | `vector_id → node_id` map | JSON | - -`faiss.IndexFlatL2` is append-only and has no metadata filtering, and all three -components are whole-file, load-everything-into-RAM structures. That combination — -not FAISS alone — drives the bulk of the surrounding complexity and is what fails -on large installs: - -1. **Deletes are fake.** On update/remove, `remove_document_docstore_nodes` - (`indexing.py:182`) deletes nodes from the _docstore_ only; the FAISS vectors - physically remain forever. The only way to truly reclaim them is a full - `rebuild=True` (re-embed every document). -2. **No metadata filtering** forces the entire custom `DocumentFilteredFaissRetriever` - (`chat.py:78-151`) with its expanding `top_k *= 2` loop to emulate a - `document_id IN (...)` filter. -3. **Whole-docstore Python scans.** `query_similar_documents` (`indexing.py:419`) - iterates the full docstore in Python to translate `document_id → node_id`. -4. **Write amplification.** Every single-document add/update/remove takes a global - `FileLock` and calls `storage_context.persist()`, which rewrites the entire - multi-GB JSON docstore — O(N) memory and O(N) disk per document operation. -5. **Brute-force query.** `IndexFlatL2` is O(N·d) per search with no ANN. - -We cannot predict or bound a user's install size, so the replacement must scale from -a handful of documents to very large corpora on a single node, with no extra service. - -## Constraints (decided during brainstorming) - -- **Engine-agnostic, on-disk store.** Paperless supports SQLite, PostgreSQL _and_ - MariaDB, so DB-integrated vectors (e.g. pgvector) are out — the vector store stays - a self-contained on-disk artifact like today's `llm_index` dir, identical across DB - backends. -- **Swap the storage layer only.** Keep llama-index as the framework. `VectorStoreIndex`, - the retrievers, the chat query engine + response synthesizer, `SimpleNodeParser`, and - the embedding-model abstraction are all unchanged. Only the `StorageContext` trio is - replaced. -- **Store: LanceDB**, integrated via a **custom `BasePydanticVectorStore` adapter** we - own (`PaperlessLanceVectorStore`) talking to `lancedb` + `pyarrow` directly — _not_ the - official `llama-index-vector-stores-lancedb` wrapper. The wrapper was evaluated and - rejected: it hard-requires `pandas`, hides `index_type` behind `**kwargs`, and _raises_ - on empty query results. A ~150-180 line adapter against llama-index's stable public - interfaces avoids all three and lets us own the table schema. (See "Why a custom - adapter".) -- **ANN: auto threshold.** Small installs use LanceDB's exact (brute-force) kNN, which - LanceDB's own docs call sufficient for datasets up to ~100K vectors. Past a threshold - we build an IVF index automatically, best-effort, with exact search as the - always-valid fallback. -- **pandas is eliminated.** `llama-index-core` does not depend on pandas, and the custom - adapter materializes LanceDB results via `pyarrow` (`.to_list()`), so pandas never - enters the dependency tree. `pyarrow` is a direct dep but arrives transitively through - `lancedb` regardless. - -## Why LanceDB - -LanceDB is the only embedded, serverless candidate architected for **disk-resident, -memory-mapped** operation — RAM does not scale with the corpus, which is the single -most important property for "tiny or very large, equally." It provides real CRUD -(predicate `delete`, `add`), filtered search, and IVF ANN, all writing to a directory on -disk. Because our adapter declares `stores_text = True`, llama-index runs off the vector -store alone — so both `SimpleDocumentStore` and `SimpleIndexStore` are deleted outright. - -Verified against `lancedb 0.33.0` with functional probes: - -- A `lancedb` table on disk is memory-mapped; writes are durable on call (connect = a - directory, table = a Lance dataset). **No `persist()` and no whole-file rewrite.** -- `table.delete('doc_id = "..."')` is a real predicate delete that physically removes - rows (probe: a 2-chunk doc dropped to 0 rows). -- `table.add(rows)` appends; `merge_insert(...).when_not_matched_by_source_delete(...)` - provides an atomic upsert that also prunes stale chunks — the incremental update path - (see §3). Verified: a doc going 5→3 chunks ends with exactly the 3 new chunks. -- `table.search(embedding).where('document_id IN (...)').limit(k).to_list()` returns - plain dicts via `pyarrow` (**no pandas**), and returns `[]` cleanly on no match — **no - raise**. - -## Why a custom adapter (not `llama-index-vector-stores-lancedb`) - -The custom adapter was proven end-to-end through llama-index's real -`VectorStoreIndex` → `VectorIndexRetriever` path with a `MockEmbedding`: build, update -(delete+insert with **zero orphan rows**), `MetadataFilters(IN)` forwarded through the -retriever, empty-filter → `[]`, and remove — all with **`pandas` never imported**. The -adapter is ~120 lines in the probe (≈150-180 production-ready) and uses only llama-index's -**stable public** primitives: `BasePydanticVectorStore`, `node_to_metadata_dict` / -`metadata_dict_to_node`, `VectorStoreQuery` / `VectorStoreQueryResult`. - -Choosing the adapter over the wrapper converts several wrapper-specific liabilities into -non-issues: - -| Wrapper liability | With the custom adapter | -| --------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- | -| Hard-imports `pandas` (`base.py:33`), uses `.to_pandas()` | Eliminated — `pyarrow.to_list()` | -| `create_index` hides `index_type`, hard-defaults `num_sub_vectors=96` (base.py:333-368) | We call `table.create_index(...)` with explicit `index_type` / partitions / sub-vectors | -| `query()` _raises_ `Warning` on empty results (base.py:560-563) | Our `query()` returns an empty `VectorStoreQueryResult` natively | -| `_to_lance_filter` prefixes `metadata.`; fragile when `_metadata_keys is None` | Dedicated top-level `document_id` column; filter is plain `document_id IN (...)`, scalar-indexable | -| Third-party package to pin and track for API drift | No integration package; depend only on stable llama-index core interfaces | - -The cost is ~150-180 lines we own and test (vs. a ~10-line subclass) — but we were -already subclassing to swallow the empty-result `Warning` and add the ANN threshold, so -the net additional code is modest and removes a dependency. - -## Design - -### 1. Storage layer - -Replace `get_or_create_storage_context()` with a vector-store factory that returns a -`PaperlessLanceVectorStore` pointed at `settings.LLM_INDEX_DIR` with an **explicit, -pinned `table_name`** (e.g. `LLM_INDEX_TABLE = "documents"`) used consistently by the -factory, the existence check (§7), and the migration detection (§8). The index is built -with `VectorStoreIndex.from_vector_store(vector_store, embed_model=...)` for the -load/query path, and `VectorStoreIndex(nodes=..., storage_context=...)` (storage context -holding only the vector store) for the rebuild path. No docstore, no index store. - -`meta.json` (embedding model name + dimension) is **kept** for embedding-model-change -detection that forces a rebuild — unchanged from today (`embedding.py:get_embedding_dim`). - -### 2. `PaperlessLanceVectorStore(BasePydanticVectorStore)` - -A custom adapter (~150-180 lines) implementing llama-index's vector-store contract -directly against `lancedb` + `pyarrow`. Class flags: `stores_text = True`, -`flat_metadata = True`. - -**Table schema** (explicit `pyarrow` schema, created lazily on first `add`): - -| Column | Type | Purpose | -| -------------- | ------------------------------- | ----------------------------------------------------------------- | -| `id` | `string` | node id (`node.node_id`) | -| `doc_id` | `string` | `node.ref_doc_id` (= `str(document.id)`, see §3) — the delete key | -| `document_id` | `string` | top-level filter column (mirrors `metadata["document_id"]`) | -| `vector` | `fixed_size_list[dim]` | embedding | -| `node_content` | `string` | `json.dumps(node_to_metadata_dict(node, remove_text=False))` | - -A dedicated top-level `document_id` column (rather than the wrapper's nested -`metadata.` struct) makes filtering a plain `document_id IN (...)` predicate and -allows an optional LanceDB **scalar index** on it for fast filtered scans. - -**Methods:** - -- `add(nodes)` — serialize each node via `node_to_metadata_dict(node, remove_text=False, -flat_metadata=True)` into the schema above; lazily `create_table` (with the explicit - schema sized to the embedding dim) or `table.add(rows)` (plain append). Used by the - **rebuild** path (bulk insert into a fresh table) and as llama-index's `add` hook. - Returns node ids. -- `upsert_document(document_id, nodes)` — the **incremental** add/update path. A single - `merge_insert("id").when_matched_update_all().when_not_matched_insert_all() -.when_not_matched_by_source_delete("document_id = ''").execute(rows)` — atomic - replace-with-prune for one document (see §3). All nodes passed must belong to the one - `document_id`. Nodes are embedded before the call (the incremental path embeds with the - configured `embed_model` rather than going through `index.insert_nodes`). -- `delete(ref_doc_id)` — `table.delete(f'doc_id = "{ref_doc_id}"')` (parameter-escaped). - Used for document removal. -- `delete_nodes(node_ids)` — `table.delete('id IN (...)')` (for completeness). -- `get_nodes(node_ids=None, filters=None)` — `table.search().where(...).to_list()`, - rebuild nodes via `metadata_dict_to_node(json.loads(row["node_content"]))`. Returns `[]` - cleanly when empty — the correct primitive for the chat no-content pre-check. -- `query(VectorStoreQuery)` — `table.search(query.query_embedding).where(_build_where( -query.filters)).limit(query.similarity_top_k).to_list()`; rebuild nodes, map LanceDB L2 - `_distance` → a similarity score, and **return an empty `VectorStoreQueryResult` on no - match (no raise)**. -- `client` property → the `lancedb` connection. - -**Filter translation** — `_build_where(MetadataFilters)` handles exactly the operators we -use (`EQ`, `IN`) on the top-level `document_id` column, string-escaping values. This is -small, fully owned, and free of the wrapper's `metadata.`-prefix / `_metadata_keys` -behavior. - -**Auto ANN index** — `maybe_create_ann_index()`, called after build/update writes **while -holding the global write lock** (it is itself a write path): if the table row count -exceeds `ANN_INDEX_MIN_ROWS` (~100K chunks, per LanceDB guidance) and no vector index -exists yet, best-effort `table.create_index(...)`: - -- **Index type by divisibility.** IVF*PQ requires `num_sub_vectors` to \_evenly divide* - the embedding dimension — LanceDB raises a hard `RuntimeError` otherwise (verified). The - dimension is detected at runtime from a user-configurable model and many common dims - (e.g. 1024) are **not** divisible by 96. So: pick a `num_sub_vectors` that divides the - dim and build **IVF_PQ**; if none exists, build **IVF_FLAT** (`index_type="IVF_FLAT"`), - which has no divisor constraint and still gives IVF/ANN speedup — strictly better than - reverting to full brute-force. (Talking to `lancedb` directly, `index_type` is just a - named argument — none of the wrapper's kwargs-smuggling.) -- `num_partitions`: LanceDB guidance is ≈ `num_rows / 4096`; clamp to a sane minimum. -- Wrapped in `try/except` — a failure logs and leaves the table on exact search, which is - always correct. - -### 3. Node identity - -In `build_document_node` (`indexing.py:109`), set the `LlamaDocument` `id_` to -`str(document.id)`. `SimpleNodeParser` propagates that as each chunk node's -`ref_doc_id`, and the adapter stores it in the `doc_id` column. Result: every chunk of a -paperless document shares `ref_doc_id == str(document.id)`, so one `delete(str(doc.id))` -clears exactly that document's chunks (verified end-to-end). `document_id` also remains in -node metadata (and is mirrored to the top-level filter column) for filtering and result -mapping. - -**Update = native upsert via `merge_insert` (one atomic commit).** The incremental -add/update path uses a single `merge_insert`, not delete-then-add: - -``` -table.merge_insert("id") - .when_matched_update_all() - .when_not_matched_insert_all() - .when_not_matched_by_source_delete(f"document_id = '{document_id}'") - .execute(new_rows) -``` - -The `when_not_matched_by_source_delete` clause — scoped to the document's `document_id` -— prunes stale trailing chunks (the case where an edit reduces a document's chunk count) -**atomically in the same commit**. Verified on 0.33.0: a doc going 5→3 chunks ends with -exactly the 3 new chunks, other documents untouched, and it works whether or not chunk -ids are deterministic (non-matching ids become a full replace). - -This is strictly better than delete-then-add on three axes: - -- **Atomicity / no transient empty state.** Queries take no lock (§6), so delete-then-add - exposes a window between the delete commit and the add commit in which a concurrent - reader sees the document with **zero chunks**. A single `merge_insert` commit eliminates - that window — a reader sees either the old or the new chunk set. -- **Half the version growth.** One commit per update instead of two, directly halving the - MVCC version accumulation that compaction (§10) must reclaim. -- **Correctness preserved** without a separate delete call. - -> **Important:** `optimize()` prunes old _versions_, **not** dead _rows_ in the live -> version. A plain upsert (update+insert without the delete clause) would leave stale -> chunks as live rows that `optimize` can never remove — so the -> `when_not_matched_by_source_delete` clause is mandatory, not optional. - -> **Index caveat (LanceDB #3177):** `merge_insert` can fail _silently_ after `optimize()` -> when a scalar index exists on the **match column**. We match on `id`, so a scalar index -> must **never** be created on `id`. The optional scalar index for filtering goes on -> `document_id` only (§2), which is not the match column. - -### 4. The four operations collapse - -| Operation | Before | After | -| ------------ | ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- | -| add / update | load whole index → `remove_document_docstore_nodes` (fake delete) → `insert_nodes` → `persist` (rewrite GB) | `store.upsert_document(str(doc.id), embedded_nodes)` (one atomic `merge_insert` commit) | -| remove | load → docstore delete → persist | `store.delete(str(doc.id))` | -| similar | load whole docstore, Python scan for node ids, custom retriever | `VectorIndexRetriever(index, similarity_top_k=k, filters=document_id IN allowed)` | -| chat | custom `DocumentFilteredFaissRetriever` (74 lines) | stock `VectorIndexRetriever(filters=document_id IN doc_ids)` | - -Deleted entirely: `remove_document_docstore_nodes`, the whole-docstore scan in -`query_similar_documents`, and `_get_document_filtered_retriever` / -`DocumentFilteredFaissRetriever` in `chat.py`. `chat.py`'s direct reaches into -`index.docstore.docs`, `index.vector_store._faiss_index`, and -`index.index_struct.nodes_dict` all disappear. The "no content" pre-check in -`_stream_chat_with_documents` becomes a `store.get_nodes(filters=...)` existence check — -the adapter's `get_nodes` returns `[]` cleanly on no match, so it is the correct -primitive for an existence test. References are still derived from returned nodes' -`metadata["document_id"]` / `metadata["title"]`, so `_get_document_references` is -unchanged. - -Because the adapter's `query()` returns an empty result on no match (it never raises), -both similar-docs and chat — which retrieve through `VectorIndexRetriever` / -`RetrieverQueryEngine` calling `vector_store.query()` internally — get a clean empty -result set instead of an exception. This was the wrapper's most disruptive wart and is -designed out, not worked around. - -### 5. Filtering - -Both similar-document and chat retrieval pass a `MetadataFilters` with a single -`MetadataFilter(key="document_id", operator=FilterOperator.IN, value=[...])` (omitted -when unconstrained). The adapter's `_build_where` translates this to the plain predicate -`document_id IN ("...","...")` against the top-level `document_id` column — no struct -path, no `_metadata_keys` dependence, so filtering is unconditionally correct on a freshly -opened table and across process restarts (proven by the fresh-process probe). - -This replaces today's `query_similar_documents` mechanism, which pre-scans the docstore -for node ids and passes `doc_ids=` to `VectorIndexRetriever` (`indexing.py:416-434`) — a -_different_ retriever mechanism. The new path relies on -`VectorIndexRetriever(filters=MetadataFilters(...))` forwarding `query.filters` into -`vector_store.query()` — **verified end-to-end in the probe** (a retriever with an `IN` -filter returned only the matching document). Still covered by a regression test (see -Testing) since it is load-bearing for both similar-docs and chat. - -### 6. Concurrency - -Keep the existing `FileLock(_index_lock_path())` around **writes** only. Each write is now -a small delta append/upsert instead of a multi-GB rewrite, so the lock is held briefly. -Queries take no lock (LanceDB reads are MVCC snapshot-consistent). The lock-free read path -is safe for updates specifically because the incremental update is a single atomic -`merge_insert` commit (§3) — a reader never observes a document mid-update. The lock still -serializes _writers_ across Celery processes to avoid `CommitConflictError`. - -**Why the global lock is load-bearing.** LanceDB's MVCC tolerates concurrent _appends_, -but concurrent _delete/update_ operations frequently conflict and fail with -`CommitConflictError` after exhausting retries (LanceDB issues #1597, #3086). Paperless's -add/update path is exactly delete-then-insert and runs from **separate Celery worker -processes**. The design is safe only because `_index_lock_path()` is a single shared lock -file under `LLM_INDEX_DIR` that serializes _all_ writers. This lock must: - -- remain a single global lock (do **not** relax to per-document granularity), and -- cover every write path — add, update, remove, **and** `maybe_create_ann_index()`. - -### 7. Index existence / rebuild trigger - -Replace `vector_store_file_exists()` with a check for the LanceDB table's existence -(`LLM_INDEX_DIR` present and the pinned `LLM_INDEX_TABLE` in `connection.table_names()`). -The existing `queue_llm_index_update_if_needed` / `load_or_build_index` rebuild-on-missing -logic is otherwise unchanged. - -**Dimension-mismatch guard.** The Lance table's vector column dimension is fixed at the -first `add()`. Beyond the `meta.json` model-change detection (which forces a rebuild when -the _model name_ changes), guard against a dimension mismatch directly: if the current -embedding dim differs from the existing table's vector dim, force a rebuild rather than -letting `add()` fail with a hard dimension error. This covers the gaps `meta.json` can't — -a missing/corrupt `meta.json`, or two models sharing a name but differing in dim. - -### 8. Migration - -The index is fully derived data, rebuildable from `Document` rows. On first run of the -new code, detect the stale FAISS format (presence of `default__vector_store.json` / -faiss files with no LanceDB table), wipe `LLM_INDEX_DIR`, and trigger a rebuild through -the existing `queue_llm_index_update_if_needed(rebuild=...)` path. No data migration and -no user action beyond the automatic background rebuild. - -### 9. Dependencies (`pyproject.toml`) - -- **Remove:** `faiss-cpu`, `llama-index-vector-stores-faiss`. -- **Add:** `lancedb` (pulls in `pyarrow`, `numpy`, `pydantic`, `tqdm`) and `pyarrow` - (declared directly since the adapter imports it, even though `lancedb` pulls it - transitively). **No `llama-index-vector-stores-lancedb`, no `pandas`** — `llama-index-core` - does not require pandas (verified) and the adapter uses `pyarrow.to_list()`. -- Confirm multi-arch wheels (linux x86_64 + aarch64, the paperless Docker targets) for - `lancedb`/`pyarrow` resolve in the lockfile. (`lancedb 0.33.0` ships manylinux x86_64 + - aarch64 wheels, matching the paperless Docker build matrix.) - -### 10. Maintenance / compaction — **required, not optional** - -MVCC has a real disk cost that this design must actively manage. LanceDB writes a **new -fragment + version on every `add`/`delete`** and retains the superseded files until -cleanup. Paperless adds/updates documents **one at a time**, so the store bloats -continuously without maintenance. Measured on 2000 × 768-dim vectors (raw float32 = -6000 KiB): - -| Scenario | On disk | Versions | -| ------------------------------------------------------- | ---------------------- | -------- | -| One bulk insert (= a rebuild) | 6016 KiB | 1 | -| 2000 single-row adds (= per-document writes) | **172,848 KiB (~28×)** | 2001 | -| After `table.optimize(cleanup_older_than=timedelta(0))` | **6344 KiB** | 1 | - -Implications: - -- **Full rebuilds are naturally compact** (bulk insert ≈ raw vector bytes), so a rebuild - resets accumulated bloat. -- **The atomic upsert (§3) halves _update_ version growth** (one commit instead of - delete-then-add's two), but every new-document insert is still its own version, so - versions accumulate over time regardless — compaction remains required. -- **Per-document writes must be compacted periodically.** Run - `table.optimize(cleanup_older_than=)` — a **single call** that compacts - fragments _and_ drops old versions — folded into the existing scheduled LLM-index - maintenance task, under the global write lock. Use a small but non-zero retention in - production (e.g. minutes–hours) so an in-flight reader on an old version isn't pulled - out from under; `timedelta(0)` is for tests/rebuild-time only. -- **Do not use the older `cleanup_old_versions()`** API: it requires the separate - `pylance` package (not pulled by `lancedb` core) and is superseded by - `optimize(cleanup_older_than=...)`. - -**On the "larger on disk than FAISS" observation:** at small scale LanceDB stores vectors -as **raw `float32`** (identical per-vector bytes to FAISS `IndexFlatL2`); vector -_compression_ only comes from the IVF*PQ index, which only exists past the ANN threshold -(§2). So a small dataset is expected to be \_comparable, not smaller*, than FAISS — and any -large discrepancy is version accumulation, fixed by the compaction above. - -> **Windows note:** the probe hit an `Access is denied` error writing a version-hint file -> during cleanup on Windows (temp-dir file locking). Paperless production is Linux -> containers, so this does not affect the deployment target, but bare-metal Windows dev -> installs may need attention. - -## Testing - -Per project conventions (pytest-style, classes with `@pytest.mark.django_db`, -pytest-mock, factory-boy, type-annotated fixtures/tests, default config). LanceDB writes -to a real directory, so tests point `settings.LLM_INDEX_DIR` at `tmp_path` and exercise a -**real** (tiny) LanceDB table with a stub embedding model returning deterministic vectors -— no mocking of store internals. - -- **add → query** returns the document. -- **update** via `upsert_document` leaves no orphan rows — re-index a document whose chunk - count _shrinks_ (e.g. 5→3) and assert exactly the new chunks remain and other documents - are untouched (this is the regression the old fake delete could not provide, and proves - `when_not_matched_by_source_delete` prunes stale chunks). -- **update is one commit** — assert the table version advances by exactly 1 per - `upsert_document` (guards the atomicity / version-growth property). -- **remove** drops all of a document's chunks. -- **filtered query** scopes results to the given `document_id`s and excludes others. -- **empty query** returns `[]` (the adapter's `query()` never raises). -- **node round-trip**: a node serialized via `node_to_metadata_dict` and reconstructed via - `metadata_dict_to_node` preserves text + metadata (`document_id`, `title`). -- **embedding-model change** → `meta.json` mismatch forces rebuild (existing behavior). -- **dimension-mismatch guard** → a current embedding dim differing from the stored table - dim forces a rebuild rather than a hard `add()` failure. -- **ANN threshold** trigger logic with a low test threshold: `maybe_create_ann_index` - attempts an index past the threshold and is a no-op below it; a `create_index` failure - is non-fatal and leaves exact search working. -- **ANN fallback on a non-divisible dim**: with an embedding dim not divisible by the PQ - `num_sub_vectors` (e.g. 1024), `maybe_create_ann_index` builds IVF_FLAT (or the - try/except fallback fires) and leaves the table queryable, not broken/unindexed. -- **Fresh-process filtering**: construct a brand-new `PaperlessLanceVectorStore` against an - existing on-disk table and assert an `IN` filter still returns the right rows — the - cross-restart path. -- **Retriever forwards filters**: assert `VectorIndexRetriever(filters=MetadataFilters(...))` - built on `VectorStoreIndex.from_vector_store(...)` actually scopes results — the - load-bearing integration seam for similar-docs and chat. -- **Compaction reclaims versions**: after several single-document writes, the maintenance - `optimize(cleanup_older_than=...)` call reduces the table to a single version and - results stay queryable afterward. -- **Upsert after optimize** (LanceDB #3177 guard): with a scalar index on `document_id` - (and none on `id`), an `upsert_document` performed _after_ `optimize()` still prunes and - replaces correctly — verified, but pinned with a test so a future index-placement change - or LanceDB regression is caught. -- Parametrize the add/update/remove variations rather than duplicating bodies. - -## Out of scope - -- Replacing llama-index for chunking, embeddings, or the chat query engine. -- Any DB-integrated (pgvector-style) path. -- Hybrid / full-text / reranked search modes offered by LanceDB (vector search only, - matching current behavior). -- Tuning embedding models or chunking parameters. - -## Open risks - -- **pyarrow/lancedb footprint.** `lancedb` + `pyarrow` (native wheels) enlarge the optional - AI feature's dependency tree; verify image-size impact when updating the lockfile. (Still - lighter than the wrapper path, which added `pandas` on top of these.) -- **ANN index parameters.** The IVF_PQ-vs-IVF_FLAT-by-divisibility logic (§2) plus the - best-effort/exact fallback contains the correctness risk, but the row threshold and - `num_partitions` heuristic should be validated on a large fixture for actual query - latency. -- **We own the adapter.** We depend on llama-index's `BasePydanticVectorStore` interface - and the `node_to_metadata_dict` / `metadata_dict_to_node` helpers. These are stable core - APIs (far more stable than the integration package), but a major llama-index bump should - re-run the end-to-end retriever test. Pin a known-good `lancedb` and `llama-index-core`. -- **`merge_insert` + scalar index on the match column (LanceDB #3177).** `merge_insert` can - fail _silently_ after `optimize()` if a scalar index exists on the match column. We match - on `id` and only index `document_id`, so we are clear — but this is an invariant to - enforce (never index `id`) and to cover with a test that exercises - upsert-after-optimize.