Removes the spec andplan files

2026-08-02 17:12:18 +00:00 · 2026-06-05 11:43:42 -07:00
parent ca6dca0efe
commit 60faa3f20f
2 changed files with 0 additions and 2169 deletions
@@ -1,448 +0,0 @@
-# Replace the FAISS vector store with LanceDB
-
-**Date:** 2026-06-02
-**Status:** Design — pending implementation plan
-**Area:** `src/paperless_ai/` (AI / LLM index feature)
-
-## Problem
-
-The optional AI feature stores document embeddings in a llama-index `StorageContext`
-made of three file-backed components persisted under `DATA_DIR/llm_index/`:
-
-| Component                                | Role                                                 | Backing           |
-| ---------------------------------------- | ---------------------------------------------------- | ----------------- |
-| `FaissVectorStore` (`faiss.IndexFlatL2`) | the vectors                                          | binary faiss file |
-| `SimpleDocumentStore`                    | node text + metadata (source of truth for retrieval) | one large JSON    |
-| `SimpleIndexStore`                       | `vector_id → node_id` map                            | JSON              |
-
-`faiss.IndexFlatL2` is append-only and has no metadata filtering, and all three
-components are whole-file, load-everything-into-RAM structures. That combination —
-not FAISS alone — drives the bulk of the surrounding complexity and is what fails
-on large installs:
-
-1. **Deletes are fake.** On update/remove, `remove_document_docstore_nodes`
-   (`indexing.py:182`) deletes nodes from the _docstore_ only; the FAISS vectors
-   physically remain forever. The only way to truly reclaim them is a full
-   `rebuild=True` (re-embed every document).
-2. **No metadata filtering** forces the entire custom `DocumentFilteredFaissRetriever`
-   (`chat.py:78-151`) with its expanding `top_k *= 2` loop to emulate a
-   `document_id IN (...)` filter.
-3. **Whole-docstore Python scans.** `query_similar_documents` (`indexing.py:419`)
-   iterates the full docstore in Python to translate `document_id → node_id`.
-4. **Write amplification.** Every single-document add/update/remove takes a global
-   `FileLock` and calls `storage_context.persist()`, which rewrites the entire
-   multi-GB JSON docstore — O(N) memory and O(N) disk per document operation.
-5. **Brute-force query.** `IndexFlatL2` is O(N·d) per search with no ANN.
-
-We cannot predict or bound a user's install size, so the replacement must scale from
-a handful of documents to very large corpora on a single node, with no extra service.
-
-## Constraints (decided during brainstorming)
-
- **Engine-agnostic, on-disk store.** Paperless supports SQLite, PostgreSQL _and_
-  MariaDB, so DB-integrated vectors (e.g. pgvector) are out — the vector store stays
-  a self-contained on-disk artifact like today's `llm_index` dir, identical across DB
-  backends.
- **Swap the storage layer only.** Keep llama-index as the framework. `VectorStoreIndex`,
-  the retrievers, the chat query engine + response synthesizer, `SimpleNodeParser`, and
-  the embedding-model abstraction are all unchanged. Only the `StorageContext` trio is
-  replaced.
- **Store: LanceDB**, integrated via a **custom `BasePydanticVectorStore` adapter** we
-  own (`PaperlessLanceVectorStore`) talking to `lancedb` + `pyarrow` directly — _not_ the
-  official `llama-index-vector-stores-lancedb` wrapper. The wrapper was evaluated and
-  rejected: it hard-requires `pandas`, hides `index_type` behind `**kwargs`, and _raises_
-  on empty query results. A ~150-180 line adapter against llama-index's stable public
-  interfaces avoids all three and lets us own the table schema. (See "Why a custom
-  adapter".)
- **ANN: auto threshold.** Small installs use LanceDB's exact (brute-force) kNN, which
-  LanceDB's own docs call sufficient for datasets up to ~100K vectors. Past a threshold
-  we build an IVF index automatically, best-effort, with exact search as the
-  always-valid fallback.
- **pandas is eliminated.** `llama-index-core` does not depend on pandas, and the custom
-  adapter materializes LanceDB results via `pyarrow` (`.to_list()`), so pandas never
-  enters the dependency tree. `pyarrow` is a direct dep but arrives transitively through
-  `lancedb` regardless.
-
-## Why LanceDB
-
-LanceDB is the only embedded, serverless candidate architected for **disk-resident,
-memory-mapped** operation — RAM does not scale with the corpus, which is the single
-most important property for "tiny or very large, equally." It provides real CRUD
-(predicate `delete`, `add`), filtered search, and IVF ANN, all writing to a directory on
-disk. Because our adapter declares `stores_text = True`, llama-index runs off the vector
-store alone — so both `SimpleDocumentStore` and `SimpleIndexStore` are deleted outright.
-
-Verified against `lancedb 0.33.0` with functional probes:
-
- A `lancedb` table on disk is memory-mapped; writes are durable on call (connect = a
-  directory, table = a Lance dataset). **No `persist()` and no whole-file rewrite.**
- `table.delete('doc_id = "..."')` is a real predicate delete that physically removes
-  rows (probe: a 2-chunk doc dropped to 0 rows).
- `table.add(rows)` appends; `merge_insert(...).when_not_matched_by_source_delete(...)`
-  provides an atomic upsert that also prunes stale chunks — the incremental update path
-  (see §3). Verified: a doc going 5→3 chunks ends with exactly the 3 new chunks.
- `table.search(embedding).where('document_id IN (...)').limit(k).to_list()` returns
-  plain dicts via `pyarrow` (**no pandas**), and returns `[]` cleanly on no match — **no
-  raise**.
-
-## Why a custom adapter (not `llama-index-vector-stores-lancedb`)
-
-The custom adapter was proven end-to-end through llama-index's real
-`VectorStoreIndex` → `VectorIndexRetriever` path with a `MockEmbedding`: build, update
-(delete+insert with **zero orphan rows**), `MetadataFilters(IN)` forwarded through the
-retriever, empty-filter → `[]`, and remove — all with **`pandas` never imported**. The
-adapter is ~120 lines in the probe (≈150-180 production-ready) and uses only llama-index's
-**stable public** primitives: `BasePydanticVectorStore`, `node_to_metadata_dict` /
-`metadata_dict_to_node`, `VectorStoreQuery` / `VectorStoreQueryResult`.
-
-Choosing the adapter over the wrapper converts several wrapper-specific liabilities into
-non-issues:
-
-| Wrapper liability                                                                       | With the custom adapter                                                                            |
-| --------------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------- |
-| Hard-imports `pandas` (`base.py:33`), uses `.to_pandas()`                               | Eliminated — `pyarrow.to_list()`                                                                   |
-| `create_index` hides `index_type`, hard-defaults `num_sub_vectors=96` (base.py:333-368) | We call `table.create_index(...)` with explicit `index_type` / partitions / sub-vectors            |
-| `query()` _raises_ `Warning` on empty results (base.py:560-563)                         | Our `query()` returns an empty `VectorStoreQueryResult` natively                                   |
-| `_to_lance_filter` prefixes `metadata.<key>`; fragile when `_metadata_keys is None`     | Dedicated top-level `document_id` column; filter is plain `document_id IN (...)`, scalar-indexable |
-| Third-party package to pin and track for API drift                                      | No integration package; depend only on stable llama-index core interfaces                          |
-
-The cost is ~150-180 lines we own and test (vs. a ~10-line subclass) — but we were
-already subclassing to swallow the empty-result `Warning` and add the ANN threshold, so
-the net additional code is modest and removes a dependency.
-
-## Design
-
-### 1. Storage layer
-
-Replace `get_or_create_storage_context()` with a vector-store factory that returns a
-`PaperlessLanceVectorStore` pointed at `settings.LLM_INDEX_DIR` with an **explicit,
-pinned `table_name`** (e.g. `LLM_INDEX_TABLE = "documents"`) used consistently by the
-factory, the existence check (§7), and the migration detection (§8). The index is built
-with `VectorStoreIndex.from_vector_store(vector_store, embed_model=...)` for the
-load/query path, and `VectorStoreIndex(nodes=..., storage_context=...)` (storage context
-holding only the vector store) for the rebuild path. No docstore, no index store.
-
-`meta.json` (embedding model name + dimension) is **kept** for embedding-model-change
-detection that forces a rebuild — unchanged from today (`embedding.py:get_embedding_dim`).
-
-### 2. `PaperlessLanceVectorStore(BasePydanticVectorStore)`
-
-A custom adapter (~150-180 lines) implementing llama-index's vector-store contract
-directly against `lancedb` + `pyarrow`. Class flags: `stores_text = True`,
-`flat_metadata = True`.
-
-**Table schema** (explicit `pyarrow` schema, created lazily on first `add`):
-
-| Column         | Type                            | Purpose                                                           |
-| -------------- | ------------------------------- | ----------------------------------------------------------------- |
-| `id`           | `string`                        | node id (`node.node_id`)                                          |
-| `doc_id`       | `string`                        | `node.ref_doc_id` (= `str(document.id)`, see §3) — the delete key |
-| `document_id`  | `string`                        | top-level filter column (mirrors `metadata["document_id"]`)       |
-| `vector`       | `fixed_size_list<float32>[dim]` | embedding                                                         |
-| `node_content` | `string`                        | `json.dumps(node_to_metadata_dict(node, remove_text=False))`      |
-
-A dedicated top-level `document_id` column (rather than the wrapper's nested
-`metadata.<key>` struct) makes filtering a plain `document_id IN (...)` predicate and
-allows an optional LanceDB **scalar index** on it for fast filtered scans.
-
-**Methods:**
-
- `add(nodes)` — serialize each node via `node_to_metadata_dict(node, remove_text=False,
-flat_metadata=True)` into the schema above; lazily `create_table` (with the explicit
-  schema sized to the embedding dim) or `table.add(rows)` (plain append). Used by the
-  **rebuild** path (bulk insert into a fresh table) and as llama-index's `add` hook.
-  Returns node ids.
- `upsert_document(document_id, nodes)` — the **incremental** add/update path. A single
-  `merge_insert("id").when_matched_update_all().when_not_matched_insert_all()
-.when_not_matched_by_source_delete("document_id = '<id>'").execute(rows)` — atomic
-  replace-with-prune for one document (see §3). All nodes passed must belong to the one
-  `document_id`. Nodes are embedded before the call (the incremental path embeds with the
-  configured `embed_model` rather than going through `index.insert_nodes`).
- `delete(ref_doc_id)` — `table.delete(f'doc_id = "{ref_doc_id}"')` (parameter-escaped).
-  Used for document removal.
- `delete_nodes(node_ids)` — `table.delete('id IN (...)')` (for completeness).
- `get_nodes(node_ids=None, filters=None)` — `table.search().where(...).to_list()`,
-  rebuild nodes via `metadata_dict_to_node(json.loads(row["node_content"]))`. Returns `[]`
-  cleanly when empty — the correct primitive for the chat no-content pre-check.
- `query(VectorStoreQuery)` — `table.search(query.query_embedding).where(_build_where(
-query.filters)).limit(query.similarity_top_k).to_list()`; rebuild nodes, map LanceDB L2
-  `_distance` → a similarity score, and **return an empty `VectorStoreQueryResult` on no
-  match (no raise)**.
- `client` property → the `lancedb` connection.
-
-**Filter translation** — `_build_where(MetadataFilters)` handles exactly the operators we
-use (`EQ`, `IN`) on the top-level `document_id` column, string-escaping values. This is
-small, fully owned, and free of the wrapper's `metadata.`-prefix / `_metadata_keys`
-behavior.
-
-**Auto ANN index** — `maybe_create_ann_index()`, called after build/update writes **while
-holding the global write lock** (it is itself a write path): if the table row count
-exceeds `ANN_INDEX_MIN_ROWS` (~100K chunks, per LanceDB guidance) and no vector index
-exists yet, best-effort `table.create_index(...)`:
-
- **Index type by divisibility.** IVF*PQ requires `num_sub_vectors` to \_evenly divide*
-  the embedding dimension — LanceDB raises a hard `RuntimeError` otherwise (verified). The
-  dimension is detected at runtime from a user-configurable model and many common dims
-  (e.g. 1024) are **not** divisible by 96. So: pick a `num_sub_vectors` that divides the
-  dim and build **IVF_PQ**; if none exists, build **IVF_FLAT** (`index_type="IVF_FLAT"`),
-  which has no divisor constraint and still gives IVF/ANN speedup — strictly better than
-  reverting to full brute-force. (Talking to `lancedb` directly, `index_type` is just a
-  named argument — none of the wrapper's kwargs-smuggling.)
- `num_partitions`: LanceDB guidance is ≈ `num_rows / 4096`; clamp to a sane minimum.
- Wrapped in `try/except` — a failure logs and leaves the table on exact search, which is
-  always correct.
-
-### 3. Node identity
-
-In `build_document_node` (`indexing.py:109`), set the `LlamaDocument` `id_` to
-`str(document.id)`. `SimpleNodeParser` propagates that as each chunk node's
-`ref_doc_id`, and the adapter stores it in the `doc_id` column. Result: every chunk of a
-paperless document shares `ref_doc_id == str(document.id)`, so one `delete(str(doc.id))`
-clears exactly that document's chunks (verified end-to-end). `document_id` also remains in
-node metadata (and is mirrored to the top-level filter column) for filtering and result
-mapping.
-
-**Update = native upsert via `merge_insert` (one atomic commit).** The incremental
-add/update path uses a single `merge_insert`, not delete-then-add:
-
-```
-table.merge_insert("id")
-     .when_matched_update_all()
-     .when_not_matched_insert_all()
-     .when_not_matched_by_source_delete(f"document_id = '{document_id}'")
-     .execute(new_rows)
-```
-
-The `when_not_matched_by_source_delete` clause — scoped to the document's `document_id`
-— prunes stale trailing chunks (the case where an edit reduces a document's chunk count)
-**atomically in the same commit**. Verified on 0.33.0: a doc going 5→3 chunks ends with
-exactly the 3 new chunks, other documents untouched, and it works whether or not chunk
-ids are deterministic (non-matching ids become a full replace).
-
-This is strictly better than delete-then-add on three axes:
-
- **Atomicity / no transient empty state.** Queries take no lock (§6), so delete-then-add
-  exposes a window between the delete commit and the add commit in which a concurrent
-  reader sees the document with **zero chunks**. A single `merge_insert` commit eliminates
-  that window — a reader sees either the old or the new chunk set.
- **Half the version growth.** One commit per update instead of two, directly halving the
-  MVCC version accumulation that compaction (§10) must reclaim.
- **Correctness preserved** without a separate delete call.
-
-> **Important:** `optimize()` prunes old _versions_, **not** dead _rows_ in the live
-> version. A plain upsert (update+insert without the delete clause) would leave stale
-> chunks as live rows that `optimize` can never remove — so the
-> `when_not_matched_by_source_delete` clause is mandatory, not optional.
-
-> **Index caveat (LanceDB #3177):** `merge_insert` can fail _silently_ after `optimize()`
-> when a scalar index exists on the **match column**. We match on `id`, so a scalar index
-> must **never** be created on `id`. The optional scalar index for filtering goes on
-> `document_id` only (§2), which is not the match column.
-
-### 4. The four operations collapse
-
-| Operation    | Before                                                                                                      | After                                                                                   |
-| ------------ | ----------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- |
-| add / update | load whole index → `remove_document_docstore_nodes` (fake delete) → `insert_nodes` → `persist` (rewrite GB) | `store.upsert_document(str(doc.id), embedded_nodes)` (one atomic `merge_insert` commit) |
-| remove       | load → docstore delete → persist                                                                            | `store.delete(str(doc.id))`                                                             |
-| similar      | load whole docstore, Python scan for node ids, custom retriever                                             | `VectorIndexRetriever(index, similarity_top_k=k, filters=document_id IN allowed)`       |
-| chat         | custom `DocumentFilteredFaissRetriever` (74 lines)                                                          | stock `VectorIndexRetriever(filters=document_id IN doc_ids)`                            |
-
-Deleted entirely: `remove_document_docstore_nodes`, the whole-docstore scan in
-`query_similar_documents`, and `_get_document_filtered_retriever` /
-`DocumentFilteredFaissRetriever` in `chat.py`. `chat.py`'s direct reaches into
-`index.docstore.docs`, `index.vector_store._faiss_index`, and
-`index.index_struct.nodes_dict` all disappear. The "no content" pre-check in
-`_stream_chat_with_documents` becomes a `store.get_nodes(filters=...)` existence check —
-the adapter's `get_nodes` returns `[]` cleanly on no match, so it is the correct
-primitive for an existence test. References are still derived from returned nodes'
-`metadata["document_id"]` / `metadata["title"]`, so `_get_document_references` is
-unchanged.
-
-Because the adapter's `query()` returns an empty result on no match (it never raises),
-both similar-docs and chat — which retrieve through `VectorIndexRetriever` /
-`RetrieverQueryEngine` calling `vector_store.query()` internally — get a clean empty
-result set instead of an exception. This was the wrapper's most disruptive wart and is
-designed out, not worked around.
-
-### 5. Filtering
-
-Both similar-document and chat retrieval pass a `MetadataFilters` with a single
-`MetadataFilter(key="document_id", operator=FilterOperator.IN, value=[...])` (omitted
-when unconstrained). The adapter's `_build_where` translates this to the plain predicate
-`document_id IN ("...","...")` against the top-level `document_id` column — no struct
-path, no `_metadata_keys` dependence, so filtering is unconditionally correct on a freshly
-opened table and across process restarts (proven by the fresh-process probe).
-
-This replaces today's `query_similar_documents` mechanism, which pre-scans the docstore
-for node ids and passes `doc_ids=` to `VectorIndexRetriever` (`indexing.py:416-434`) — a
-_different_ retriever mechanism. The new path relies on
-`VectorIndexRetriever(filters=MetadataFilters(...))` forwarding `query.filters` into
-`vector_store.query()` — **verified end-to-end in the probe** (a retriever with an `IN`
-filter returned only the matching document). Still covered by a regression test (see
-Testing) since it is load-bearing for both similar-docs and chat.
-
-### 6. Concurrency
-
-Keep the existing `FileLock(_index_lock_path())` around **writes** only. Each write is now
-a small delta append/upsert instead of a multi-GB rewrite, so the lock is held briefly.
-Queries take no lock (LanceDB reads are MVCC snapshot-consistent). The lock-free read path
-is safe for updates specifically because the incremental update is a single atomic
-`merge_insert` commit (§3) — a reader never observes a document mid-update. The lock still
-serializes _writers_ across Celery processes to avoid `CommitConflictError`.
-
-**Why the global lock is load-bearing.** LanceDB's MVCC tolerates concurrent _appends_,
-but concurrent _delete/update_ operations frequently conflict and fail with
-`CommitConflictError` after exhausting retries (LanceDB issues #1597, #3086). Paperless's
-add/update path is exactly delete-then-insert and runs from **separate Celery worker
-processes**. The design is safe only because `_index_lock_path()` is a single shared lock
-file under `LLM_INDEX_DIR` that serializes _all_ writers. This lock must:
-
- remain a single global lock (do **not** relax to per-document granularity), and
- cover every write path — add, update, remove, **and** `maybe_create_ann_index()`.
-
-### 7. Index existence / rebuild trigger
-
-Replace `vector_store_file_exists()` with a check for the LanceDB table's existence
-(`LLM_INDEX_DIR` present and the pinned `LLM_INDEX_TABLE` in `connection.table_names()`).
-The existing `queue_llm_index_update_if_needed` / `load_or_build_index` rebuild-on-missing
-logic is otherwise unchanged.
-
-**Dimension-mismatch guard.** The Lance table's vector column dimension is fixed at the
-first `add()`. Beyond the `meta.json` model-change detection (which forces a rebuild when
-the _model name_ changes), guard against a dimension mismatch directly: if the current
-embedding dim differs from the existing table's vector dim, force a rebuild rather than
-letting `add()` fail with a hard dimension error. This covers the gaps `meta.json` can't —
-a missing/corrupt `meta.json`, or two models sharing a name but differing in dim.
-
-### 8. Migration
-
-The index is fully derived data, rebuildable from `Document` rows. On first run of the
-new code, detect the stale FAISS format (presence of `default__vector_store.json` /
-faiss files with no LanceDB table), wipe `LLM_INDEX_DIR`, and trigger a rebuild through
-the existing `queue_llm_index_update_if_needed(rebuild=...)` path. No data migration and
-no user action beyond the automatic background rebuild.
-
-### 9. Dependencies (`pyproject.toml`)
-
- **Remove:** `faiss-cpu`, `llama-index-vector-stores-faiss`.
- **Add:** `lancedb` (pulls in `pyarrow`, `numpy`, `pydantic`, `tqdm`) and `pyarrow`
-  (declared directly since the adapter imports it, even though `lancedb` pulls it
-  transitively). **No `llama-index-vector-stores-lancedb`, no `pandas`** — `llama-index-core`
-  does not require pandas (verified) and the adapter uses `pyarrow.to_list()`.
- Confirm multi-arch wheels (linux x86_64 + aarch64, the paperless Docker targets) for
-  `lancedb`/`pyarrow` resolve in the lockfile. (`lancedb 0.33.0` ships manylinux x86_64 +
-  aarch64 wheels, matching the paperless Docker build matrix.)
-
-### 10. Maintenance / compaction — **required, not optional**
-
-MVCC has a real disk cost that this design must actively manage. LanceDB writes a **new
-fragment + version on every `add`/`delete`** and retains the superseded files until
-cleanup. Paperless adds/updates documents **one at a time**, so the store bloats
-continuously without maintenance. Measured on 2000 × 768-dim vectors (raw float32 =
-6000 KiB):
-
-| Scenario                                                | On disk                | Versions |
-| ------------------------------------------------------- | ---------------------- | -------- |
-| One bulk insert (= a rebuild)                           | 6016 KiB               | 1        |
-| 2000 single-row adds (= per-document writes)            | **172,848 KiB (~28×)** | 2001     |
-| After `table.optimize(cleanup_older_than=timedelta(0))` | **6344 KiB**           | 1        |
-
-Implications:
-
- **Full rebuilds are naturally compact** (bulk insert ≈ raw vector bytes), so a rebuild
-  resets accumulated bloat.
- **The atomic upsert (§3) halves _update_ version growth** (one commit instead of
-  delete-then-add's two), but every new-document insert is still its own version, so
-  versions accumulate over time regardless — compaction remains required.
- **Per-document writes must be compacted periodically.** Run
-  `table.optimize(cleanup_older_than=<retention>)` — a **single call** that compacts
-  fragments _and_ drops old versions — folded into the existing scheduled LLM-index
-  maintenance task, under the global write lock. Use a small but non-zero retention in
-  production (e.g. minutes–hours) so an in-flight reader on an old version isn't pulled
-  out from under; `timedelta(0)` is for tests/rebuild-time only.
- **Do not use the older `cleanup_old_versions()`** API: it requires the separate
-  `pylance` package (not pulled by `lancedb` core) and is superseded by
-  `optimize(cleanup_older_than=...)`.
-
-**On the "larger on disk than FAISS" observation:** at small scale LanceDB stores vectors
-as **raw `float32`** (identical per-vector bytes to FAISS `IndexFlatL2`); vector
-_compression_ only comes from the IVF*PQ index, which only exists past the ANN threshold
-(§2). So a small dataset is expected to be \_comparable, not smaller*, than FAISS — and any
-large discrepancy is version accumulation, fixed by the compaction above.
-
-> **Windows note:** the probe hit an `Access is denied` error writing a version-hint file
-> during cleanup on Windows (temp-dir file locking). Paperless production is Linux
-> containers, so this does not affect the deployment target, but bare-metal Windows dev
-> installs may need attention.
-
-## Testing
-
-Per project conventions (pytest-style, classes with `@pytest.mark.django_db`,
-pytest-mock, factory-boy, type-annotated fixtures/tests, default config). LanceDB writes
-to a real directory, so tests point `settings.LLM_INDEX_DIR` at `tmp_path` and exercise a
-**real** (tiny) LanceDB table with a stub embedding model returning deterministic vectors
-— no mocking of store internals.
-
- **add → query** returns the document.
- **update** via `upsert_document` leaves no orphan rows — re-index a document whose chunk
-  count _shrinks_ (e.g. 5→3) and assert exactly the new chunks remain and other documents
-  are untouched (this is the regression the old fake delete could not provide, and proves
-  `when_not_matched_by_source_delete` prunes stale chunks).
- **update is one commit** — assert the table version advances by exactly 1 per
-  `upsert_document` (guards the atomicity / version-growth property).
- **remove** drops all of a document's chunks.
- **filtered query** scopes results to the given `document_id`s and excludes others.
- **empty query** returns `[]` (the adapter's `query()` never raises).
- **node round-trip**: a node serialized via `node_to_metadata_dict` and reconstructed via
-  `metadata_dict_to_node` preserves text + metadata (`document_id`, `title`).
- **embedding-model change** → `meta.json` mismatch forces rebuild (existing behavior).
- **dimension-mismatch guard** → a current embedding dim differing from the stored table
-  dim forces a rebuild rather than a hard `add()` failure.
- **ANN threshold** trigger logic with a low test threshold: `maybe_create_ann_index`
-  attempts an index past the threshold and is a no-op below it; a `create_index` failure
-  is non-fatal and leaves exact search working.
- **ANN fallback on a non-divisible dim**: with an embedding dim not divisible by the PQ
-  `num_sub_vectors` (e.g. 1024), `maybe_create_ann_index` builds IVF_FLAT (or the
-  try/except fallback fires) and leaves the table queryable, not broken/unindexed.
- **Fresh-process filtering**: construct a brand-new `PaperlessLanceVectorStore` against an
-  existing on-disk table and assert an `IN` filter still returns the right rows — the
-  cross-restart path.
- **Retriever forwards filters**: assert `VectorIndexRetriever(filters=MetadataFilters(...))`
-  built on `VectorStoreIndex.from_vector_store(...)` actually scopes results — the
-  load-bearing integration seam for similar-docs and chat.
- **Compaction reclaims versions**: after several single-document writes, the maintenance
-  `optimize(cleanup_older_than=...)` call reduces the table to a single version and
-  results stay queryable afterward.
- **Upsert after optimize** (LanceDB #3177 guard): with a scalar index on `document_id`
-  (and none on `id`), an `upsert_document` performed _after_ `optimize()` still prunes and
-  replaces correctly — verified, but pinned with a test so a future index-placement change
-  or LanceDB regression is caught.
- Parametrize the add/update/remove variations rather than duplicating bodies.
-
-## Out of scope
-
- Replacing llama-index for chunking, embeddings, or the chat query engine.
- Any DB-integrated (pgvector-style) path.
- Hybrid / full-text / reranked search modes offered by LanceDB (vector search only,
-  matching current behavior).
- Tuning embedding models or chunking parameters.
-
-## Open risks
-
- **pyarrow/lancedb footprint.** `lancedb` + `pyarrow` (native wheels) enlarge the optional
-  AI feature's dependency tree; verify image-size impact when updating the lockfile. (Still
-  lighter than the wrapper path, which added `pandas` on top of these.)
- **ANN index parameters.** The IVF_PQ-vs-IVF_FLAT-by-divisibility logic (§2) plus the
-  best-effort/exact fallback contains the correctness risk, but the row threshold and
-  `num_partitions` heuristic should be validated on a large fixture for actual query
-  latency.
- **We own the adapter.** We depend on llama-index's `BasePydanticVectorStore` interface
-  and the `node_to_metadata_dict` / `metadata_dict_to_node` helpers. These are stable core
-  APIs (far more stable than the integration package), but a major llama-index bump should
-  re-run the end-to-end retriever test. Pin a known-good `lancedb` and `llama-index-core`.
- **`merge_insert` + scalar index on the match column (LanceDB #3177).** `merge_insert` can
-  fail _silently_ after `optimize()` if a scalar index exists on the match column. We match
-  on `id` and only index `document_id`, so we are clear — but this is an invariant to
-  enforce (never index `id`) and to cover with a test that exercises
-  upsert-after-optimize.