Design: Replace FAISS vector store with LanceDB (custom adapter)

Spec for swapping the AI feature's llama-index FAISS StorageContext trio (FaissVectorStore + SimpleDocumentStore + SimpleIndexStore) for LanceDB via a custom BasePydanticVectorStore adapter (no llama-index-vector-stores-lancedb, no pandas). Covers: disk-resident memory-mapped storage, native merge_insert upsert with when_not_matched_by_source_delete, MetadataFilters(IN) filtering on a top-level document_id column, auto IVF ANN threshold (IVF_FLAT fallback), MVCC compaction via optimize(cleanup_older_than=...), migration, concurrency, and testing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-07-31 08:05:59 +00:00 · 2026-06-03 09:41:03 -07:00
parent abdcdccf08
commit c9ee9edb95
1 changed files with 448 additions and 0 deletions
@@ -0,0 +1,448 @@
+# Replace the FAISS vector store with LanceDB
+
+**Date:** 2026-06-02
+**Status:** Design — pending implementation plan
+**Area:** `src/paperless_ai/` (AI / LLM index feature)
+
+## Problem
+
+The optional AI feature stores document embeddings in a llama-index `StorageContext`
+made of three file-backed components persisted under `DATA_DIR/llm_index/`:
+
+| Component                                | Role                                                 | Backing           |
+| ---------------------------------------- | ---------------------------------------------------- | ----------------- |
+| `FaissVectorStore` (`faiss.IndexFlatL2`) | the vectors                                          | binary faiss file |
+| `SimpleDocumentStore`                    | node text + metadata (source of truth for retrieval) | one large JSON    |
+| `SimpleIndexStore`                       | `vector_id → node_id` map                            | JSON              |
+
+`faiss.IndexFlatL2` is append-only and has no metadata filtering, and all three
+components are whole-file, load-everything-into-RAM structures. That combination —
+not FAISS alone — drives the bulk of the surrounding complexity and is what fails
+on large installs:
+
+1. **Deletes are fake.** On update/remove, `remove_document_docstore_nodes`
+   (`indexing.py:182`) deletes nodes from the _docstore_ only; the FAISS vectors
+   physically remain forever. The only way to truly reclaim them is a full
+   `rebuild=True` (re-embed every document).
+2. **No metadata filtering** forces the entire custom `DocumentFilteredFaissRetriever`
+   (`chat.py:78-151`) with its expanding `top_k *= 2` loop to emulate a
+   `document_id IN (...)` filter.
+3. **Whole-docstore Python scans.** `query_similar_documents` (`indexing.py:419`)
+   iterates the full docstore in Python to translate `document_id → node_id`.
+4. **Write amplification.** Every single-document add/update/remove takes a global
+   `FileLock` and calls `storage_context.persist()`, which rewrites the entire
+   multi-GB JSON docstore — O(N) memory and O(N) disk per document operation.
+5. **Brute-force query.** `IndexFlatL2` is O(N·d) per search with no ANN.
+
+We cannot predict or bound a user's install size, so the replacement must scale from
+a handful of documents to very large corpora on a single node, with no extra service.
+
+## Constraints (decided during brainstorming)
+
+- **Engine-agnostic, on-disk store.** Paperless supports SQLite, PostgreSQL _and_
+  MariaDB, so DB-integrated vectors (e.g. pgvector) are out — the vector store stays
+  a self-contained on-disk artifact like today's `llm_index` dir, identical across DB
+  backends.
+- **Swap the storage layer only.** Keep llama-index as the framework. `VectorStoreIndex`,
+  the retrievers, the chat query engine + response synthesizer, `SimpleNodeParser`, and
+  the embedding-model abstraction are all unchanged. Only the `StorageContext` trio is
+  replaced.
+- **Store: LanceDB**, integrated via a **custom `BasePydanticVectorStore` adapter** we
+  own (`PaperlessLanceVectorStore`) talking to `lancedb` + `pyarrow` directly — _not_ the
+  official `llama-index-vector-stores-lancedb` wrapper. The wrapper was evaluated and
+  rejected: it hard-requires `pandas`, hides `index_type` behind `**kwargs`, and _raises_
+  on empty query results. A ~150-180 line adapter against llama-index's stable public
+  interfaces avoids all three and lets us own the table schema. (See "Why a custom
+  adapter".)
+- **ANN: auto threshold.** Small installs use LanceDB's exact (brute-force) kNN, which
+  LanceDB's own docs call sufficient for datasets up to ~100K vectors. Past a threshold
+  we build an IVF index automatically, best-effort, with exact search as the
+  always-valid fallback.
+- **pandas is eliminated.** `llama-index-core` does not depend on pandas, and the custom
+  adapter materializes LanceDB results via `pyarrow` (`.to_list()`), so pandas never
+  enters the dependency tree. `pyarrow` is a direct dep but arrives transitively through
+  `lancedb` regardless.
+
+## Why LanceDB
+
+LanceDB is the only embedded, serverless candidate architected for **disk-resident,
+memory-mapped** operation — RAM does not scale with the corpus, which is the single
+most important property for "tiny or very large, equally." It provides real CRUD
+(predicate `delete`, `add`), filtered search, and IVF ANN, all writing to a directory on
+disk. Because our adapter declares `stores_text = True`, llama-index runs off the vector
+store alone — so both `SimpleDocumentStore` and `SimpleIndexStore` are deleted outright.
+
+Verified against `lancedb 0.33.0` with functional probes:
+
+- A `lancedb` table on disk is memory-mapped; writes are durable on call (connect = a
+  directory, table = a Lance dataset). **No `persist()` and no whole-file rewrite.**
+- `table.delete('doc_id = "..."')` is a real predicate delete that physically removes
+  rows (probe: a 2-chunk doc dropped to 0 rows).
+- `table.add(rows)` appends; `merge_insert(...).when_not_matched_by_source_delete(...)`
+  provides an atomic upsert that also prunes stale chunks — the incremental update path
+  (see §3). Verified: a doc going 5→3 chunks ends with exactly the 3 new chunks.
+- `table.search(embedding).where('document_id IN (...)').limit(k).to_list()` returns
+  plain dicts via `pyarrow` (**no pandas**), and returns `[]` cleanly on no match — **no
+  raise**.
+
+## Why a custom adapter (not `llama-index-vector-stores-lancedb`)
+
+The custom adapter was proven end-to-end through llama-index's real
+`VectorStoreIndex` → `VectorIndexRetriever` path with a `MockEmbedding`: build, update
+(delete+insert with **zero orphan rows**), `MetadataFilters(IN)` forwarded through the
+retriever, empty-filter → `[]`, and remove — all with **`pandas` never imported**. The
+adapter is ~120 lines in the probe (≈150-180 production-ready) and uses only llama-index's
+**stable public** primitives: `BasePydanticVectorStore`, `node_to_metadata_dict` /
+`metadata_dict_to_node`, `VectorStoreQuery` / `VectorStoreQueryResult`.
+
+Choosing the adapter over the wrapper converts several wrapper-specific liabilities into
+non-issues:
+
+| Wrapper liability                                                                       | With the custom adapter                                                                            |
+| --------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
+| Hard-imports `pandas` (`base.py:33`), uses `.to_pandas()`                               | Eliminated — `pyarrow.to_list()`                                                                   |
+| `create_index` hides `index_type`, hard-defaults `num_sub_vectors=96` (base.py:333-368) | We call `table.create_index(...)` with explicit `index_type` / partitions / sub-vectors            |
+| `query()` _raises_ `Warning` on empty results (base.py:560-563)                         | Our `query()` returns an empty `VectorStoreQueryResult` natively                                   |
+| `_to_lance_filter` prefixes `metadata.<key>`; fragile when `_metadata_keys is None`     | Dedicated top-level `document_id` column; filter is plain `document_id IN (...)`, scalar-indexable |
+| Third-party package to pin and track for API drift                                      | No integration package; depend only on stable llama-index core interfaces                          |
+
+The cost is ~150-180 lines we own and test (vs. a ~10-line subclass) — but we were
+already subclassing to swallow the empty-result `Warning` and add the ANN threshold, so
+the net additional code is modest and removes a dependency.
+
+## Design
+
+### 1. Storage layer
+
+Replace `get_or_create_storage_context()` with a vector-store factory that returns a
+`PaperlessLanceVectorStore` pointed at `settings.LLM_INDEX_DIR` with an **explicit,
+pinned `table_name`** (e.g. `LLM_INDEX_TABLE = "documents"`) used consistently by the
+factory, the existence check (§7), and the migration detection (§8). The index is built
+with `VectorStoreIndex.from_vector_store(vector_store, embed_model=...)` for the
+load/query path, and `VectorStoreIndex(nodes=..., storage_context=...)` (storage context
+holding only the vector store) for the rebuild path. No docstore, no index store.
+
+`meta.json` (embedding model name + dimension) is **kept** for embedding-model-change
+detection that forces a rebuild — unchanged from today (`embedding.py:get_embedding_dim`).
+
+### 2. `PaperlessLanceVectorStore(BasePydanticVectorStore)`
+
+A custom adapter (~150-180 lines) implementing llama-index's vector-store contract
+directly against `lancedb` + `pyarrow`. Class flags: `stores_text = True`,
+`flat_metadata = True`.
+
+**Table schema** (explicit `pyarrow` schema, created lazily on first `add`):
+
+| Column         | Type                            | Purpose                                                           |
+| -------------- | ------------------------------- | ----------------------------------------------------------------- |
+| `id`           | `string`                        | node id (`node.node_id`)                                          |
+| `doc_id`       | `string`                        | `node.ref_doc_id` (= `str(document.id)`, see §3) — the delete key |
+| `document_id`  | `string`                        | top-level filter column (mirrors `metadata["document_id"]`)       |
+| `vector`       | `fixed_size_list<float32>[dim]` | embedding                                                         |
+| `node_content` | `string`                        | `json.dumps(node_to_metadata_dict(node, remove_text=False))`      |
+
+A dedicated top-level `document_id` column (rather than the wrapper's nested
+`metadata.<key>` struct) makes filtering a plain `document_id IN (...)` predicate and
+allows an optional LanceDB **scalar index** on it for fast filtered scans.
+
+**Methods:**
+
+- `add(nodes)` — serialize each node via `node_to_metadata_dict(node, remove_text=False,
+flat_metadata=True)` into the schema above; lazily `create_table` (with the explicit
+  schema sized to the embedding dim) or `table.add(rows)` (plain append). Used by the
+  **rebuild** path (bulk insert into a fresh table) and as llama-index's `add` hook.
+  Returns node ids.
+- `upsert_document(document_id, nodes)` — the **incremental** add/update path. A single
+  `merge_insert("id").when_matched_update_all().when_not_matched_insert_all()
+.when_not_matched_by_source_delete("document_id = '<id>'").execute(rows)` — atomic
+  replace-with-prune for one document (see §3). All nodes passed must belong to the one
+  `document_id`. Nodes are embedded before the call (the incremental path embeds with the
+  configured `embed_model` rather than going through `index.insert_nodes`).
+- `delete(ref_doc_id)` — `table.delete(f'doc_id = "{ref_doc_id}"')` (parameter-escaped).
+  Used for document removal.
+- `delete_nodes(node_ids)` — `table.delete('id IN (...)')` (for completeness).
+- `get_nodes(node_ids=None, filters=None)` — `table.search().where(...).to_list()`,
+  rebuild nodes via `metadata_dict_to_node(json.loads(row["node_content"]))`. Returns `[]`
+  cleanly when empty — the correct primitive for the chat no-content pre-check.
+- `query(VectorStoreQuery)` — `table.search(query.query_embedding).where(_build_where(
+query.filters)).limit(query.similarity_top_k).to_list()`; rebuild nodes, map LanceDB L2
+  `_distance` → a similarity score, and **return an empty `VectorStoreQueryResult` on no
+  match (no raise)**.
+- `client` property → the `lancedb` connection.
+
+**Filter translation** — `_build_where(MetadataFilters)` handles exactly the operators we
+use (`EQ`, `IN`) on the top-level `document_id` column, string-escaping values. This is
+small, fully owned, and free of the wrapper's `metadata.`-prefix / `_metadata_keys`
+behavior.
+
+**Auto ANN index** — `maybe_create_ann_index()`, called after build/update writes **while
+holding the global write lock** (it is itself a write path): if the table row count
+exceeds `ANN_INDEX_MIN_ROWS` (~100K chunks, per LanceDB guidance) and no vector index
+exists yet, best-effort `table.create_index(...)`:
+
+- **Index type by divisibility.** IVF*PQ requires `num_sub_vectors` to \_evenly divide*
+  the embedding dimension — LanceDB raises a hard `RuntimeError` otherwise (verified). The
+  dimension is detected at runtime from a user-configurable model and many common dims
+  (e.g. 1024) are **not** divisible by 96. So: pick a `num_sub_vectors` that divides the
+  dim and build **IVF_PQ**; if none exists, build **IVF_FLAT** (`index_type="IVF_FLAT"`),
+  which has no divisor constraint and still gives IVF/ANN speedup — strictly better than
+  reverting to full brute-force. (Talking to `lancedb` directly, `index_type` is just a
+  named argument — none of the wrapper's kwargs-smuggling.)
+- `num_partitions`: LanceDB guidance is ≈ `num_rows / 4096`; clamp to a sane minimum.
+- Wrapped in `try/except` — a failure logs and leaves the table on exact search, which is
+  always correct.
+
+### 3. Node identity
+
+In `build_document_node` (`indexing.py:109`), set the `LlamaDocument` `id_` to
+`str(document.id)`. `SimpleNodeParser` propagates that as each chunk node's
+`ref_doc_id`, and the adapter stores it in the `doc_id` column. Result: every chunk of a
+paperless document shares `ref_doc_id == str(document.id)`, so one `delete(str(doc.id))`
+clears exactly that document's chunks (verified end-to-end). `document_id` also remains in
+node metadata (and is mirrored to the top-level filter column) for filtering and result
+mapping.
+
+**Update = native upsert via `merge_insert` (one atomic commit).** The incremental
+add/update path uses a single `merge_insert`, not delete-then-add:
+
+```
+table.merge_insert("id")
+     .when_matched_update_all()
+     .when_not_matched_insert_all()
+     .when_not_matched_by_source_delete(f"document_id = '{document_id}'")
+     .execute(new_rows)
+```
+
+The `when_not_matched_by_source_delete` clause — scoped to the document's `document_id`
+— prunes stale trailing chunks (the case where an edit reduces a document's chunk count)
+**atomically in the same commit**. Verified on 0.33.0: a doc going 5→3 chunks ends with
+exactly the 3 new chunks, other documents untouched, and it works whether or not chunk
+ids are deterministic (non-matching ids become a full replace).
+
+This is strictly better than delete-then-add on three axes:
+
+- **Atomicity / no transient empty state.** Queries take no lock (§6), so delete-then-add
+  exposes a window between the delete commit and the add commit in which a concurrent
+  reader sees the document with **zero chunks**. A single `merge_insert` commit eliminates
+  that window — a reader sees either the old or the new chunk set.
+- **Half the version growth.** One commit per update instead of two, directly halving the
+  MVCC version accumulation that compaction (§10) must reclaim.
+- **Correctness preserved** without a separate delete call.
+
+> **Important:** `optimize()` prunes old _versions_, **not** dead _rows_ in the live
+> version. A plain upsert (update+insert without the delete clause) would leave stale
+> chunks as live rows that `optimize` can never remove — so the
+> `when_not_matched_by_source_delete` clause is mandatory, not optional.
+
+> **Index caveat (LanceDB #3177):** `merge_insert` can fail _silently_ after `optimize()`
+> when a scalar index exists on the **match column**. We match on `id`, so a scalar index
+> must **never** be created on `id`. The optional scalar index for filtering goes on
+> `document_id` only (§2), which is not the match column.
+
+### 4. The four operations collapse
+
+| Operation    | Before                                                                                                      | After                                                                                   |
+| ------------ | ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
+| add / update | load whole index → `remove_document_docstore_nodes` (fake delete) → `insert_nodes` → `persist` (rewrite GB) | `store.upsert_document(str(doc.id), embedded_nodes)` (one atomic `merge_insert` commit) |
+| remove       | load → docstore delete → persist                                                                            | `store.delete(str(doc.id))`                                                             |
+| similar      | load whole docstore, Python scan for node ids, custom retriever                                             | `VectorIndexRetriever(index, similarity_top_k=k, filters=document_id IN allowed)`       |
+| chat         | custom `DocumentFilteredFaissRetriever` (74 lines)                                                          | stock `VectorIndexRetriever(filters=document_id IN doc_ids)`                            |
+
+Deleted entirely: `remove_document_docstore_nodes`, the whole-docstore scan in
+`query_similar_documents`, and `_get_document_filtered_retriever` /
+`DocumentFilteredFaissRetriever` in `chat.py`. `chat.py`'s direct reaches into
+`index.docstore.docs`, `index.vector_store._faiss_index`, and
+`index.index_struct.nodes_dict` all disappear. The "no content" pre-check in
+`_stream_chat_with_documents` becomes a `store.get_nodes(filters=...)` existence check —
+the adapter's `get_nodes` returns `[]` cleanly on no match, so it is the correct
+primitive for an existence test. References are still derived from returned nodes'
+`metadata["document_id"]` / `metadata["title"]`, so `_get_document_references` is
+unchanged.
+
+Because the adapter's `query()` returns an empty result on no match (it never raises),
+both similar-docs and chat — which retrieve through `VectorIndexRetriever` /
+`RetrieverQueryEngine` calling `vector_store.query()` internally — get a clean empty
+result set instead of an exception. This was the wrapper's most disruptive wart and is
+designed out, not worked around.
+
+### 5. Filtering
+
+Both similar-document and chat retrieval pass a `MetadataFilters` with a single
+`MetadataFilter(key="document_id", operator=FilterOperator.IN, value=[...])` (omitted
+when unconstrained). The adapter's `_build_where` translates this to the plain predicate
+`document_id IN ("...","...")` against the top-level `document_id` column — no struct
+path, no `_metadata_keys` dependence, so filtering is unconditionally correct on a freshly
+opened table and across process restarts (proven by the fresh-process probe).
+
+This replaces today's `query_similar_documents` mechanism, which pre-scans the docstore
+for node ids and passes `doc_ids=` to `VectorIndexRetriever` (`indexing.py:416-434`) — a
+_different_ retriever mechanism. The new path relies on
+`VectorIndexRetriever(filters=MetadataFilters(...))` forwarding `query.filters` into
+`vector_store.query()` — **verified end-to-end in the probe** (a retriever with an `IN`
+filter returned only the matching document). Still covered by a regression test (see
+Testing) since it is load-bearing for both similar-docs and chat.
+
+### 6. Concurrency
+
+Keep the existing `FileLock(_index_lock_path())` around **writes** only. Each write is now
+a small delta append/upsert instead of a multi-GB rewrite, so the lock is held briefly.
+Queries take no lock (LanceDB reads are MVCC snapshot-consistent). The lock-free read path
+is safe for updates specifically because the incremental update is a single atomic
+`merge_insert` commit (§3) — a reader never observes a document mid-update. The lock still
+serializes _writers_ across Celery processes to avoid `CommitConflictError`.
+
+**Why the global lock is load-bearing.** LanceDB's MVCC tolerates concurrent _appends_,
+but concurrent _delete/update_ operations frequently conflict and fail with
+`CommitConflictError` after exhausting retries (LanceDB issues #1597, #3086). Paperless's
+add/update path is exactly delete-then-insert and runs from **separate Celery worker
+processes**. The design is safe only because `_index_lock_path()` is a single shared lock
+file under `LLM_INDEX_DIR` that serializes _all_ writers. This lock must:
+
+- remain a single global lock (do **not** relax to per-document granularity), and
+- cover every write path — add, update, remove, **and** `maybe_create_ann_index()`.
+
+### 7. Index existence / rebuild trigger
+
+Replace `vector_store_file_exists()` with a check for the LanceDB table's existence
+(`LLM_INDEX_DIR` present and the pinned `LLM_INDEX_TABLE` in `connection.table_names()`).
+The existing `queue_llm_index_update_if_needed` / `load_or_build_index` rebuild-on-missing
+logic is otherwise unchanged.
+
+**Dimension-mismatch guard.** The Lance table's vector column dimension is fixed at the
+first `add()`. Beyond the `meta.json` model-change detection (which forces a rebuild when
+the _model name_ changes), guard against a dimension mismatch directly: if the current
+embedding dim differs from the existing table's vector dim, force a rebuild rather than
+letting `add()` fail with a hard dimension error. This covers the gaps `meta.json` can't —
+a missing/corrupt `meta.json`, or two models sharing a name but differing in dim.
+
+### 8. Migration
+
+The index is fully derived data, rebuildable from `Document` rows. On first run of the
+new code, detect the stale FAISS format (presence of `default__vector_store.json` /
+faiss files with no LanceDB table), wipe `LLM_INDEX_DIR`, and trigger a rebuild through
+the existing `queue_llm_index_update_if_needed(rebuild=...)` path. No data migration and
+no user action beyond the automatic background rebuild.
+
+### 9. Dependencies (`pyproject.toml`)
+
+- **Remove:** `faiss-cpu`, `llama-index-vector-stores-faiss`.
+- **Add:** `lancedb` (pulls in `pyarrow`, `numpy`, `pydantic`, `tqdm`) and `pyarrow`
+  (declared directly since the adapter imports it, even though `lancedb` pulls it
+  transitively). **No `llama-index-vector-stores-lancedb`, no `pandas`** — `llama-index-core`
+  does not require pandas (verified) and the adapter uses `pyarrow.to_list()`.
+- Confirm multi-arch wheels (linux x86_64 + aarch64, the paperless Docker targets) for
+  `lancedb`/`pyarrow` resolve in the lockfile. (`lancedb 0.33.0` ships manylinux x86_64 +
+  aarch64 wheels, matching the paperless Docker build matrix.)
+
+### 10. Maintenance / compaction — **required, not optional**
+
+MVCC has a real disk cost that this design must actively manage. LanceDB writes a **new
+fragment + version on every `add`/`delete`** and retains the superseded files until
+cleanup. Paperless adds/updates documents **one at a time**, so the store bloats
+continuously without maintenance. Measured on 2000 × 768-dim vectors (raw float32 =
+6000 KiB):
+
+| Scenario                                                | On disk                | Versions |
+| ------------------------------------------------------- | ---------------------- | -------- |
+| One bulk insert (= a rebuild)                           | 6016 KiB               | 1        |
+| 2000 single-row adds (= per-document writes)            | **172,848 KiB (~28×)** | 2001     |
+| After `table.optimize(cleanup_older_than=timedelta(0))` | **6344 KiB**           | 1        |
+
+Implications:
+
+- **Full rebuilds are naturally compact** (bulk insert ≈ raw vector bytes), so a rebuild
+  resets accumulated bloat.
+- **The atomic upsert (§3) halves _update_ version growth** (one commit instead of
+  delete-then-add's two), but every new-document insert is still its own version, so
+  versions accumulate over time regardless — compaction remains required.
+- **Per-document writes must be compacted periodically.** Run
+  `table.optimize(cleanup_older_than=<retention>)` — a **single call** that compacts
+  fragments _and_ drops old versions — folded into the existing scheduled LLM-index
+  maintenance task, under the global write lock. Use a small but non-zero retention in
+  production (e.g. minutes–hours) so an in-flight reader on an old version isn't pulled
+  out from under; `timedelta(0)` is for tests/rebuild-time only.
+- **Do not use the older `cleanup_old_versions()`** API: it requires the separate
+  `pylance` package (not pulled by `lancedb` core) and is superseded by
+  `optimize(cleanup_older_than=...)`.
+
+**On the "larger on disk than FAISS" observation:** at small scale LanceDB stores vectors
+as **raw `float32`** (identical per-vector bytes to FAISS `IndexFlatL2`); vector
+_compression_ only comes from the IVF*PQ index, which only exists past the ANN threshold
+(§2). So a small dataset is expected to be \_comparable, not smaller*, than FAISS — and any
+large discrepancy is version accumulation, fixed by the compaction above.
+
+> **Windows note:** the probe hit an `Access is denied` error writing a version-hint file
+> during cleanup on Windows (temp-dir file locking). Paperless production is Linux
+> containers, so this does not affect the deployment target, but bare-metal Windows dev
+> installs may need attention.
+
+## Testing
+
+Per project conventions (pytest-style, classes with `@pytest.mark.django_db`,
+pytest-mock, factory-boy, type-annotated fixtures/tests, default config). LanceDB writes
+to a real directory, so tests point `settings.LLM_INDEX_DIR` at `tmp_path` and exercise a
+**real** (tiny) LanceDB table with a stub embedding model returning deterministic vectors
+— no mocking of store internals.
+
+- **add → query** returns the document.
+- **update** via `upsert_document` leaves no orphan rows — re-index a document whose chunk
+  count _shrinks_ (e.g. 5→3) and assert exactly the new chunks remain and other documents
+  are untouched (this is the regression the old fake delete could not provide, and proves
+  `when_not_matched_by_source_delete` prunes stale chunks).
+- **update is one commit** — assert the table version advances by exactly 1 per
+  `upsert_document` (guards the atomicity / version-growth property).
+- **remove** drops all of a document's chunks.
+- **filtered query** scopes results to the given `document_id`s and excludes others.
+- **empty query** returns `[]` (the adapter's `query()` never raises).
+- **node round-trip**: a node serialized via `node_to_metadata_dict` and reconstructed via
+  `metadata_dict_to_node` preserves text + metadata (`document_id`, `title`).
+- **embedding-model change** → `meta.json` mismatch forces rebuild (existing behavior).
+- **dimension-mismatch guard** → a current embedding dim differing from the stored table
+  dim forces a rebuild rather than a hard `add()` failure.
+- **ANN threshold** trigger logic with a low test threshold: `maybe_create_ann_index`
+  attempts an index past the threshold and is a no-op below it; a `create_index` failure
+  is non-fatal and leaves exact search working.
+- **ANN fallback on a non-divisible dim**: with an embedding dim not divisible by the PQ
+  `num_sub_vectors` (e.g. 1024), `maybe_create_ann_index` builds IVF_FLAT (or the
+  try/except fallback fires) and leaves the table queryable, not broken/unindexed.
+- **Fresh-process filtering**: construct a brand-new `PaperlessLanceVectorStore` against an
+  existing on-disk table and assert an `IN` filter still returns the right rows — the
+  cross-restart path.
+- **Retriever forwards filters**: assert `VectorIndexRetriever(filters=MetadataFilters(...))`
+  built on `VectorStoreIndex.from_vector_store(...)` actually scopes results — the
+  load-bearing integration seam for similar-docs and chat.
+- **Compaction reclaims versions**: after several single-document writes, the maintenance
+  `optimize(cleanup_older_than=...)` call reduces the table to a single version and
+  results stay queryable afterward.
+- **Upsert after optimize** (LanceDB #3177 guard): with a scalar index on `document_id`
+  (and none on `id`), an `upsert_document` performed _after_ `optimize()` still prunes and
+  replaces correctly — verified, but pinned with a test so a future index-placement change
+  or LanceDB regression is caught.
+- Parametrize the add/update/remove variations rather than duplicating bodies.
+
+## Out of scope
+
+- Replacing llama-index for chunking, embeddings, or the chat query engine.
+- Any DB-integrated (pgvector-style) path.
+- Hybrid / full-text / reranked search modes offered by LanceDB (vector search only,
+  matching current behavior).
+- Tuning embedding models or chunking parameters.
+
+## Open risks
+
+- **pyarrow/lancedb footprint.** `lancedb` + `pyarrow` (native wheels) enlarge the optional
+  AI feature's dependency tree; verify image-size impact when updating the lockfile. (Still
+  lighter than the wrapper path, which added `pandas` on top of these.)
+- **ANN index parameters.** The IVF_PQ-vs-IVF_FLAT-by-divisibility logic (§2) plus the
+  best-effort/exact fallback contains the correctness risk, but the row threshold and
+  `num_partitions` heuristic should be validated on a large fixture for actual query
+  latency.
+- **We own the adapter.** We depend on llama-index's `BasePydanticVectorStore` interface
+  and the `node_to_metadata_dict` / `metadata_dict_to_node` helpers. These are stable core
+  APIs (far more stable than the integration package), but a major llama-index bump should
+  re-run the end-to-end retriever test. Pin a known-good `lancedb` and `llama-index-core`.
+- **`merge_insert` + scalar index on the match column (LanceDB #3177).** `merge_insert` can
+  fail _silently_ after `optimize()` if a scalar index exists on the match column. We match
+  on `id` and only index `document_id`, so we are clear — but this is an invariant to
+  enforce (never index `id`) and to cover with a test that exercises
+  upsert-after-optimize.