Marks some things as done

2026-06-30 17:24:22 +00:00 · 2026-06-12 11:38:20 -07:00
parent b2151acfd5
commit 85cd9b657b
6 changed files with 0 additions and 0 deletions
@@ -1,115 +0,0 @@
-# LanceDB Node Metadata Enrichment
-
-**Status:** Design
-**Date:** 2026-06-09
-**Branch target:** `dev`
-**Prerequisite for:** AI taxonomy hints (`2026-05-20-ai-taxonomy-hints-design.md`)
-**Depends on:** `feature-lancedb-schema-migrate`
-
-## Problem
-
-`build_llm_index_text` currently includes three short structured values in the embedding text:
-
-```python
-lines = [
-    f"Filename: {doc.filename}",
-    f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
-    f"Archive Serial Number: {doc.archive_serial_number or ''}",
-    ...
-]
-```
-
-These don't belong in the embedding. The embedding should capture semantic content — the meaning of the document — not structured identifiers. Including them means vectors are partly "polluted" with filing metadata, making similarity search less accurate. The existing TODO in `embedding.py:115` explicitly calls this out.
-
-The right home for structured values is `node.metadata` (excluded from the embedding, but surfaced to the LLM when nodes are retrieved as context). `title`, `tags`, `correspondent`, and `document_type` already follow this pattern.
-
-Notes and custom fields stay in the embedding text — Notes is long free text, custom fields are dynamic and their semantic content belongs in the vector.
-
-## Changes
-
-### `paperless_ai/embedding.py` — `build_llm_index_text`
-
-Remove the three lines and the TODO comment:
-
-```python
-# remove:
-f"Filename: {doc.filename}",
-f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
-f"Archive Serial Number: {doc.archive_serial_number or ''}",
-```
-
-`Notes` and `Custom Fields` lines remain.
-
-### `paperless_ai/indexing.py` — `build_document_node`
-
-Add the three fields to the metadata dict:
-
-```python
-metadata = {
-    "document_id": str(document.id),
-    "title": document.title,
-    "filename": document.filename or "",
-    "storage_path": document.storage_path.name if document.storage_path else None,
-    "archive_serial_number": document.archive_serial_number,
-    "tags": [t.name for t in document.tags.all()],
-    "correspondent": document.correspondent.name if document.correspondent else None,
-    "document_type": document.document_type.name if document.document_type else None,
-    "created": document.created.isoformat() if document.created else None,
-    "added": document.added.isoformat() if document.added else None,
-    "modified": document.modified.isoformat(),
-}
-```
-
-All three new keys must also appear in `excluded_embed_metadata_keys` (consistent with all existing keys — none of the metadata is included in the embedding text).
-
-### `paperless_ai/vector_store.py` — schema migration
-
-Register migration version 2 on the `feature-lancedb-schema-migrate` framework. The embedding text changes, so all existing vectors are stale — a full rebuild is required. The migration's `apply` is a no-op; the rebuild handles regenerating all nodes with the correct metadata.
-
-```python
-MIGRATIONS: list[Migration] = [
-    Migration(
-        version=2,
-        description="move filename/storage_path/asn from embedding text to metadata",
-        requires_reembed=True,
-        apply=lambda table: None,
-    ),
-]
-CURRENT_SCHEMA_VERSION: Final[int] = 2
-```
-
-On next `update_llm_index` run, `requires_reembed_migration()` returns `True`, triggering a full drop-and-rebuild. All new nodes carry the three metadata fields. No manual intervention required.
-
-## Impact
-
- Similarity search quality improves slightly — vectors are more purely semantic.
- The LLM receives `filename`, `storage_path`, and `archive_serial_number` as structured metadata alongside retrieved chunks, rather than embedded in the chunk text. Same information, cleaner separation.
- One forced index rebuild on upgrade (beta: acceptable).
- `node.metadata["storage_path"]`, `node.metadata["filename"]`, `node.metadata["archive_serial_number"]` are available on all retrieved nodes after rebuild — unblocks the taxonomy hints feature.
-
-## Testing
-
-All tests use pytest style — grouped under classes, `@pytest.mark.django_db` on the class, `pytest-mock`'s `mocker` fixture, every fixture and test signature type-annotated. Format with `ruff` directly.
-
-### `paperless_ai/tests/test_embedding.py` (modify)
-
- `class TestBuildLlmIndexText:`
-  - Assert `"Filename:"` is **not** in the output.
-  - Assert `"Storage Path:"` is **not** in the output.
-  - Assert `"Archive Serial Number:"` is **not** in the output.
-  - Assert Notes and Custom Fields lines are still present (regression guard).
-
-### `paperless_ai/tests/test_ai_indexing.py` (modify)
-
- `class TestBuildDocumentNode:`
-  - `filename` is in `node.metadata` and in `excluded_embed_metadata_keys`.
-  - `storage_path` is in `node.metadata` (name string) and in `excluded_embed_metadata_keys`; `None` when document has no storage path.
-  - `archive_serial_number` is in `node.metadata` and in `excluded_embed_metadata_keys`; `None` when unset.
-  - None of the three appear in the embedding text produced for the node.
-
-### `paperless_ai/tests/test_vector_store.py` (modify)
-
- `class TestSchemaMigrations:`
-  - `pending_migrations()` returns the v2 migration when stored version is 1.
-  - `requires_reembed_migration()` returns `True` when stored version is 1.
-  - `apply_structural_migrations()` stops at the v2 migration (skips reembed entries).
@@ -1,138 +0,0 @@
-# LLM Index Schema Migrations (second spec)
-
-Date: 2026-06-10
-Depends on: `docs/superpowers/specs/2026-06-10-sqlite-vec-vector-store-design.md` and its implementation plan (`docs/superpowers/plans/2026-06-10-sqlite-vec-transition.md`). This spec layers on top of the completed sqlite-vec transition; do not start it before that branch lands.
-Supersedes: PR #12968 (in-place LanceDB migrations). The machinery design there is carried over nearly verbatim; only the storage backend specifics change. #12968 should be closed with a pointer here once this ships.
-
-Scope update (user decision, 2026-06-10): the `embedding.py:115` metadata restructure originally drafted as Part 2 of this spec was folded into the transition plan instead (its Task 5), because the transition forces a full rebuild anyway, so the embedded-text change rides along with no extra re-embed cost. This spec is now machinery-only: it ships with an EMPTY migration registry, ready for whatever schema change comes next. Part 2 below is retained as the worked example of how a re-embed migration would be registered, since the next one will not have a free rebuild to piggyback on.
-
-## Part 1: Schema migration machinery (ported from PR #12968)
-
-### What carries over unchanged
-
-The PR's design survives the store swap intact and is adopted as-is:
-
- `Migration` frozen dataclass: `version: int`, `description: str`, `requires_reembed: bool`, `apply: Callable` (compare/hash-excluded field).
- `MIGRATIONS: list[Migration]` ordered registry + `CURRENT_SCHEMA_VERSION: Final[int]` in `vector_store.py`. To add a migration: bump the constant, append an entry.
- Store surface: `stored_schema_version() -> int` (0 when unrecorded, so pre-versioning tables treat every migration as pending), `pending_migrations()`, `requires_reembed_migration()`, `apply_structural_migrations() -> list[Migration]`.
- The stop-at-first-reembed-boundary rule in `apply_structural_migrations()`: structural migrations are applied in version order only up to the first pending `requires_reembed=True` entry, so the version counter can never jump past a re-embed boundary and silently skip the rebuild. (This was the subtle correctness insight of #12968; preserve the comment.)
- The `update_llm_index()` hook, verbatim from the PR:
-
-```python
-    with write_store(embed_model_name=model_name) as store:
-        if not rebuild and store.table_exists():
-            store.apply_structural_migrations()
-            if store.requires_reembed_migration():
-                logger.warning(
-                    "Schema migration requires re-embedding; forcing LLM index rebuild.",
-                )
-                rebuild = True
-```
-
- Test approach from the PR: mock `MIGRATIONS`/`CURRENT_SCHEMA_VERSION` with `mocker.patch`, spy on `drop_table` to distinguish in-place from rebuild, one test per path (structural applied without rebuild; pending re-embed forces rebuild).
-
-### What changes for sqlite-vec
-
-**1. Version storage: `index_meta['schema_version']` instead of `schema_version.json`.**
-The Lance store needed a sidecar JSON file because Lance had no convenient mutable metadata. The sqlite-vec store already has the `index_meta` key/value table, which is transactional with the data itself (a migration and its version bump commit atomically, which the file never could). Concretely:
-
- `_create_table(dim)` additionally writes `schema_version = str(CURRENT_SCHEMA_VERSION)` (fresh tables are always current).
- `stored_schema_version()` reads the meta key, returns 0 on absence/garbage.
- `drop_table()` already does `DELETE FROM index_meta`, which clears the version with it. No sidecar file, no unlink bookkeeping.
- `apply_structural_migrations()` writes the new version inside the same transaction as the last applied migration.
-
-**2. `apply` receives the store, not a table handle.**
-Lance migrations got the raw table for `add_columns`/`alter_columns`. vec0 virtual tables do not support arbitrary `ALTER TABLE`, so structural migrations are SQL against the store's connection. Signature: `apply: Callable[[PaperlessSqliteVecVectorStore], None]`. The store exposes what migrations need: `.client` (connection), `._table_name`, `.vector_dim()`, and the rebuild helper below.
-
-**3. Structural migrations are create+copy+rename, sharing the compact() machinery.**
-The sqlite-vec `compact()` already implements the only structural mutation vec0 supports: build a new table, `INSERT INTO ... SELECT` (vectors copied bit-for-bit, no re-embedding), drop old, rename. Factor it into a shared helper on the store:
-
-```python
-def rebuild_table(
-    self,
-    *,
-    create_sql: str | None = None,
-    copy_select: str | None = None,
-) -> None:
-    """Copy live rows into a freshly created table and swap it in.
-
-    Defaults reproduce the current schema (compaction). Structural
-    migrations pass a modified CREATE statement and a matching SELECT
-    (e.g. adding a column with a literal default). Runs in one
-    transaction; VACUUM afterwards.
-    """
-```
-
-`compact()` becomes a thin caller (threshold check + `rebuild_table()`), and a structural migration like "add a `+page_count` aux column" is:
-
-```python
-Migration(
-    version=2,
-    description="add page_count auxiliary column",
-    requires_reembed=False,
-    apply=lambda store: store.rebuild_table(
-        create_sql=...,        # CREATE VIRTUAL TABLE ... with the new column
-        copy_select="SELECT id, document_id, modified, node_content, embedding, '' FROM {old}",
-    ),
-)
-```
-
-A pleasant consequence: every structural migration is also a compaction (the copy drops dead rows), and the file-format risk surface is one helper with one test suite instead of two code paths.
-
-**4. Bootstrap version for the sqlite-vec store is 1.**
-The transition plan ships the new store without machinery; tables it creates carry no `schema_version` key and therefore read as 0. This release lands with `CURRENT_SCHEMA_VERSION = 1` and `MIGRATIONS = []`, so the bootstrap is unconditionally safe: a 0-version table has no pending migrations and `apply_structural_migrations()` simply stamps it to 1. (The metadata restructure having moved into the transition itself is what makes this clean; the registry's first real entry will be v2, written against tables that are all stamped.)
-
-## Part 2 (worked example, IMPLEMENTED IN THE TRANSITION): the metadata TODO as a re-embed migration
-
-This section was implemented as Task 5 of the transition plan and ships with the store swap, not with this spec. It is kept as the reference example of how to register the next re-embed migration.
-
-### The change
-
-`build_llm_index_text()` currently embeds three short structured values in the body text:
-
-```python
-        f"Filename: {doc.filename}",
-        f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
-        f"Archive Serial Number: {doc.archive_serial_number or ''}",
-```
-
-Per the TODO, move them to `node.metadata` (excluded from embeddings, visible to the LLM via llama-index's metadata prepend), the same treatment title/tags/correspondent/document_type got in PR #12944. Notes and Custom Fields stay in the body (long free text / dynamic count, as the TODO says).
-
-1. `embedding.py build_llm_index_text()`: delete the three lines above (the `lines` list keeps Notes, Custom Fields, and Content). Update the TODO comment to describe only what remains intentional (Notes/Custom Fields stay embedded), or delete it.
-2. `indexing.py build_document_node()` metadata dict gains:
-
-```python
-        "filename": doc.filename,
-        "storage_path": document.storage_path.name if document.storage_path else None,
-        "archive_serial_number": document.archive_serial_number,
-```
-
-(`None`/int values are fine here: this dict lives in the node-content JSON, not in vec0 metadata columns; only `document_id`/`modified` are columns with the NULL restriction. Matches the existing convention of `correspondent: None`.) 3. `excluded_embed_metadata_keys=list(metadata.keys())` already covers the new keys; `excluded_llm_metadata_keys` stays `["document_id"]` so the LLM sees the new fields.
-
-### Why this class of change needs a migration
-
-Removing the three lines changes the embedded text of every document, so stored vectors no longer match what the current code would embed. Incremental updates only re-embed documents whose `modified` changed, so without a forced rebuild the index would be a mixed old/new-text population indefinitely. This particular change escaped that fate only because the transition's forced rebuild covers it. The next embedded-text change will not have that luxury and gets registered like this:
-
-```python
-CURRENT_SCHEMA_VERSION: Final[int] = 2
-
-MIGRATIONS: list[Migration] = [
-    Migration(
-        version=2,
-        description="<what changed about the embedded text>",
-        requires_reembed=True,
-        apply=lambda store: None,
-    ),
-]
-```
-
-On the first `update_llm_index` after upgrade, the hook sees the pending re-embed migration, logs, and rebuilds.
-
-### Test plan
-
-Machinery only (the metadata change is tested in the transition plan's Task 5). Port of the #12968 tests, dedicated file `test_vector_store_migrations.py`: structural migration applies in-place without `drop_table`; pending re-embed forces rebuild; version stamping on create/drop; bootstrap stamping of a pre-machinery 0-version table to 1; stop-at-boundary with a mixed [structural v2, reembed v3, structural v4] registry asserting v4 is NOT applied and the stored version stays at 2; `rebuild_table()` round-trips rows byte-for-byte (shared with compact tests).
-
-### Open questions
-
- PR #12968 disposition: close with a comment pointing at this spec once the machinery lands (the Lance-specific `add_columns` path has no successor; vec0 cannot do in-place column adds).
- `created`/`added` fields are also candidates for future structural metadata work, but nothing needs them now (YAGNI; noted only so the next reader does not re-derive it).
@@ -1,155 +0,0 @@
-# sqlite-vec Vector Store Design (replaces PaperlessLanceVectorStore)
-
-Date: 2026-06-10
-
-Context: LanceDB wheels SIGILL on non-AVX2 CPUs (#12970); research in `2026-06-10-vector-store-alternatives-research.md` selected sqlite-vec. This is a beta feature, so a one-time re-embed on upgrade is acceptable. Every claim marked [VERIFIED] below was empirically tested against the actual PyPI wheel (0.1.9, and 0.1.10a4 where noted), either in this repo's scratch harness (`/tmp/vstore-avx-test/explore_sqlitevec*.py`) or by the issues-audit agent.
-
-## Version pin: `sqlite-vec==0.1.9`, and why it is load-bearing
-
- The 0.1.9 linux x86_64 wheel is built with **no SIMD flags at all** (`vec_debug()` shows empty build flags) and passed our qemu Westmere (SSE4.2, no AVX) and SandyBridge (AVX, no AVX2) emulation tests [VERIFIED]. This is the entire point of the migration.
- The **0.1.10-alpha.4 wheel regresses this**: built with `-mavx -DSQLITE_VEC_ENABLE_AVX` file-wide, no runtime CPU dispatch. It can SIGILL on AVX-less CPUs, including Goldmont Atom/Celeron NAS boxes, exactly the #12970 user base [VERIFIED via vec_debug on the wheel].
- Guardrails: pin `==0.1.9` exactly; log `SELECT vec_version(), vec_debug()` at store init as an AVX canary; before ever bumping to 0.1.10+, re-check the wheel flags (and consider raising the runtime-dispatch issue upstream first).
- arm64: 0.1.9 manylinux aarch64 wheel is a proper ELF64 binary, no NEON flags baked [VERIFIED]. (The broken 32-bit "aarch64" wheel era was 0.1.6, fixed since.)
- No sdist on PyPI (asg017/sqlite-vec#211, open) and no musl wheels; fine for our Debian-based image, blocks Alpine bare-metal installs.
-
-## Schema
-
-One dedicated SQLite database file in `LLM_INDEX_DIR` (e.g. `llmindex.db`), never the Django DB. Connections set `PRAGMA journal_mode=WAL`, `busy_timeout`, `synchronous=NORMAL`.
-
-```sql
-CREATE VIRTUAL TABLE nodes USING vec0(
-    id TEXT PRIMARY KEY,             -- node_id (uuid)
-    document_id TEXT,                -- METADATA column, deliberately NOT a partition key
-    modified TEXT,                   -- ISO timestamp; never NULL (sentinel "")
-    +node_content TEXT,              -- auxiliary column: JSON payload, any size
-    embedding float[{dim}] distance_metric=cosine
-);
-
-CREATE TABLE IF NOT EXISTS index_meta (key TEXT PRIMARY KEY, value TEXT);
-- rows: embed_model, dim, schema_version, created_by_vec_version
-```
-
-Design decisions, each verified on 0.1.9:
-
- **`document_id` is a metadata column, not a partition key.** With a partition key, `k` applies per partition: `k=5 AND document_id IN (3 docs)` returns 15 rows (asg017/sqlite-vec#142, open) [VERIFIED]. As a metadata column the same query returns a correct global top-k of exactly 5 [VERIFIED]. `query_similar_documents()` passes permission-scoped `IN` lists, so per-partition semantics would over-fetch k x N(docs). At our scale the partition-pruning speedup is not needed (filtered KNN at 20K x 1024 was _faster_ than unfiltered: 39 ms vs 74 ms).
- **One document column, not two.** The Lance store carried both `doc_id` (ref_doc_id) and `document_id`; in our usage they are always the same value (`str(document.id)`), so the new schema keeps only `document_id`.
- **TEXT primary key works** (insert, UPDATE, DELETE, duplicate rejection) [VERIFIED]. There is no usable rowid mapping with a TEXT pk, which we do not need.
- **Aux column for the payload.** `+node_content` holds the multi-KB JSON; aux columns cannot appear in KNN WHERE clauses (loud error, not silent) [VERIFIED], which we never do, and are selectable in scans and KNN results [VERIFIED].
- **Metadata columns reject NULL** (asg017/sqlite-vec#141, open) [VERIFIED]. `_row()` must keep coercing everything through `str(... or "")` as it already does today.
- **`distance_metric=cosine`**: similarity maps as `1 - distance` (identical vector gives distance 0.0 [VERIFIED]). For unit-norm embeddings the ranking equals today's L2 ranking; for non-normalized models cosine is the safer default, and the beta re-embed makes the behavior change free. (L2 + `1/(1+d)` remains available if exact parity is ever wanted.)
- **Vectors are always bound as float32 BLOBs** (`struct.pack`/`np.tobytes`), never JSON text: bypasses the locale-dependent `strtod` parsing bug (asg017/sqlite-vec#241, open) entirely.
- Limits, all comfortable: dims <= 8192, k <= 4096, chunk_size default 1024 [VERIFIED]. TEXT metadata has no length cap; values > 12 bytes go to a shadow text table with a prefix fast-path, and the one historical bug at that boundary (long-metadata DELETE, #274) is fixed in 0.1.9.
-
-## Method mapping (PaperlessLanceVectorStore -> PaperlessSqliteVecVectorStore)
-
-| Current method                                | sqlite-vec implementation                                                                                                              | Notes                                                                                                                                                                                                                   |
-| --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
-| `__init__(uri, table_name, embed_model_name)` | `sqlite3.connect(path)` + `enable_load_extension` + `sqlite_vec.load()` + PRAGMAs                                                      | Same lazy "table may not exist yet" stance                                                                                                                                                                              |
-| `client` property                             | the `sqlite3.Connection`                                                                                                               |                                                                                                                                                                                                                         |
-| `table_exists()`                              | `SELECT 1 FROM sqlite_master WHERE name='nodes'`                                                                                       |                                                                                                                                                                                                                         |
-| `vector_dim()`                                | `index_meta['dim']`                                                                                                                    | Written at table creation; wrong-dim inserts are rejected by vec0 anyway [VERIFIED]                                                                                                                                     |
-| `drop_table()`                                | `DROP TABLE nodes`                                                                                                                     | Drops all 7 shadow tables with it [VERIFIED]; also clear `index_meta`                                                                                                                                                   |
-| `stored_model_name()` / `config_mismatch()`   | `index_meta['embed_model']`                                                                                                            | Same conservative None handling                                                                                                                                                                                         |
-| `_schema(dim, model)`                         | the CREATE statements above                                                                                                            | dim from first batch, as today (`_ensure_table`)                                                                                                                                                                        |
-| `_row(node)`                                  | same dict, vector packed to bytes                                                                                                      | keep `str(... or "")` coercion (NULL rejection)                                                                                                                                                                         |
-| `add(nodes)`                                  | `executemany(INSERT ...)` inside one transaction                                                                                       | ~3,300 rows/s at 1024 dims measured; batching via transactions                                                                                                                                                          |
-| `upsert_document(document_id, nodes)`         | `BEGIN; DELETE FROM nodes WHERE document_id = ?; executemany(INSERT); COMMIT`                                                          | **Not** `INSERT OR REPLACE`: broken on vec0 (asg017/sqlite-vec#259, open). Transaction gives the same no-transient-empty-state guarantee as merge_insert; rollback verified [VERIFIED]                                  |
-| `delete(ref_doc_id)`                          | `DELETE FROM nodes WHERE document_id = ?`                                                                                              |                                                                                                                                                                                                                         |
-| `get_nodes(filters)`                          | `SELECT id, document_id, node_content, embedding FROM nodes [WHERE ...]`                                                               | full scans on vec0 work [VERIFIED]; 45 ms / 20K rows                                                                                                                                                                    |
-| `query(VectorStoreQuery)`                     | `SELECT id, node_content, embedding, distance FROM nodes WHERE embedding MATCH ? AND k = ? [AND filters]` then Python-slice to `top_k` | `k = ?` is mandatory; `LIMIT` cannot be combined with `k` [VERIFIED]; results arrive distance-sorted [VERIFIED]; similarities = `1 - distance`                                                                          |
-| `_build_where(filters)`                       | same EQ/IN translation, but emitting `?` placeholders + params list                                                                    | **Upgrade**: bound parameters replace today's manual `_escape()` string interpolation                                                                                                                                   |
-| `get_modified_times()`                        | `SELECT document_id, modified FROM nodes` + first-seen dedupe in Python                                                                | identical logic                                                                                                                                                                                                         |
-| `ensure_document_id_scalar_index()`           | no-op (delete if nothing else needs it)                                                                                                | metadata filters are evaluated in the chunk scan; nothing to create                                                                                                                                                     |
-| `maybe_create_ann_index()`                    | no-op on 0.1.9                                                                                                                         | ANN (rescore/diskann) is 0.1.10-alpha territory; adopting an ANN index makes the file unreadable by 0.1.9 (one-way door), while flat tables round-trip 0.1.9 <-> 0.1.10a4 cleanly [VERIFIED]. Revisit post-0.1.10-final |
-| `compact(retention_seconds)`                  | **rebuild-based compaction**, see below                                                                                                | replaces Lance MVCC cleanup                                                                                                                                                                                             |
-
-Filter constraint surface (loud errors otherwise, [VERIFIED]): only `=, !=, <, <=, >, >=, IN` on metadata columns in KNN queries. We use only EQ/IN. Never use `NOT IN` (the vtab cannot see it; SQLite post-filters and silently under-delivers below k, asg017/sqlite-vec#116).
-
-## Compaction: the one real behavioral difference
-
-vec0 DELETE only flips a validity bit; space is never reclaimed, and VACUUM recovers only about half (asg017/sqlite-vec#54, #220, open; fix PRs #243/#210 unmerged). Measured: 5 delete+reinsert cycles on 2K rows grew the file 3.32 MB -> 6.56 MB; VACUUM got back to 4.94 MB. Paperless's per-document churn (every document edit is a delete+reinsert) hits this directly.
-
-So `compact()` becomes the maintainer-endorsed rebuild (asg017/sqlite-vec#205):
-
-```sql
-CREATE VIRTUAL TABLE nodes_new USING vec0(...);
-INSERT INTO nodes_new SELECT id, document_id, modified, node_content, embedding FROM nodes;
-DROP TABLE nodes;
-ALTER TABLE nodes_new RENAME TO nodes;  -- then VACUUM
-```
-
-This copies vectors without re-embedding, runs under the existing write FileLock, and slots into the existing `document_llmindex compact` command and the scheduled maintenance task. A cheap trigger heuristic: rebuild when `count(*) in nodes_rowids shadow` (cumulative) exceeds ~2x live rows, or just keep the existing scheduled cadence.
-
-## Concurrency
-
-vec0 is a plain vtab over ordinary shadow tables, so standard SQLite WAL semantics apply, and the existing architecture is already the textbook arrangement: writers serialized by `settings.LLM_INDEX_LOCK` FileLock, readers concurrent via WAL. Verified across processes: a reader during another process's open write transaction does not block and sees a consistent pre-transaction snapshot; post-commit it sees the new rows [VERIFIED]. No sqlite-vec-specific multi-process corruption, locking, or segfault reports exist in the tracker. The 0.1.10a4 cached-statement fix (#295) is a Firefox/mozStorage `sqlite3_close()` issue; CPython's `sqlite3` is unaffected, no Python-side reports.
-
-Same caveat as the main SQLite DB: `LLM_INDEX_DIR` should not be on NFS.
-
-## Performance expectations (measured on the 0.1.9 no-SIMD wheel)
-
- KNN 20K rows x 1024 dims: ~74 ms plain, ~39 ms with a metadata EQ filter.
- 100K x 768: 185 ms/query (vs 497 ms for LanceDB exact search on identical data).
- Extrapolated 500K x 1024-1536: ~0.9-1.8 s/query; 384 dims roughly 4x faster. Acceptable for suggestions/chat at the extreme tail; typical installs (low tens of thousands of chunks) are tens of ms.
- Insert: ~3,300 rows/s at 1024 dims in a single transaction.
- File size: ~raw vector size (~4.3 KB/row at 1024 dims), no compression; plus the bloat behavior above.
-
-## Migration from the Lance store
-
-Beta policy: re-embed. On startup/first index task: if `LLM_INDEX_DIR` contains a Lance table but no `llmindex.db`, log and queue a full rebuild, then remove the Lance directory. No cross-store vector copy, no lancedb import anywhere in the path (which is what un-breaks #12970 hosts: they currently crash at import, have no usable index, and get a fresh build).
-
-PR #12968's migration machinery maps onto `index_meta['schema_version']`: structural migrations = create-new-table + `INSERT ... SELECT` + rename (vectors copied, no re-embed; same shape as the compaction rebuild); re-embed migrations = drop + full rebuild, jumping straight to the current version.
-
-## Dependency changes
-
- Add: `sqlite-vec==0.1.9` (one ~100 KB platform wheel, zero Python deps).
- Remove: `lancedb~=0.33.0` (and its pylance/lancedb wheels, ~40 MB). `pyarrow` leaves this module; check whether anything else in the AI stack still needs it before dropping from pyproject.
-
-## Test plan notes
-
- pytest-style per project convention; the store tests can run against a tmp_path DB file (or `:memory:` for pure-logic tests; extension loading works on uv-managed CPython [VERIFIED]).
- Port the existing `test_vector_store.py` surface; add dedicated tests for: upsert transactionality (no transient empty state mid-upsert from a second connection), NULL-coercion in `_row()`, k-slice behavior, EQ/IN filter correctness, compaction rebuild preserving rows byte-for-byte, vec_debug canary logging.
- The qemu matrix (`/tmp/vstore-avx-test/`) can be re-run against any future sqlite-vec bump: `qemu-x86_64 -cpu Westmere venv/bin/python candidate_test.py sqlite_vec <dir>`.
-
-## Benchmark harness
-
-`src/bench_vector_store.py` -- standalone head-to-head comparison run during the migration window when both `PaperlessLanceVectorStore` and `PaperlessSqliteVecVectorStore` coexist (Task 3 Phase A of the implementation plan). After Phase B replaces `vector_store.py`, the Lance import fails gracefully and only the sqlite-vec half runs (useful for post-migration baseline checks).
-
-```bash
-cd src
-uv run python bench_vector_store.py            # auto-generates bench_data.pkl on first run
-uv run python bench_vector_store.py --regenerate  # force re-embed
-```
-
-**Phase 1 (data generation, skipped if `bench_data.pkl` exists):** Faker generates `--n-docs` (default 2000) fake documents -- title, body, correspondent, ISO timestamp. Each body is split into `--chunks-per-doc` (default 3) equal-length chunks (~6000 total nodes). A warm-up embed call fires before generation to ensure the model is resident in GPU. All chunk texts are embedded via Ollama `/api/embed` in batches of 32 and saved to `bench_data.pkl`. Faker seed 42 for reproducibility.
-
-**Phase 2 (benchmark):** Each store runs in an isolated `tempfile.TemporaryDirectory()`. Query vectors are drawn reproducibly from the corpus (every 10th node, wrapping).
-
-| Operation                                 | Reps | Metric                |
-| ----------------------------------------- | ---- | --------------------- |
-| `add()` bulk insert                       | 1    | total time            |
-| `query()` plain                           | 50   | p50 / p95             |
-| `query()` filtered (IN on 20% of doc IDs) | 50   | p50 / p95             |
-| `get_modified_times()`                    | 20   | p50                   |
-| `upsert_document()`                       | 50   | p50 / p95             |
-| `compact()`                               | 1    | total time            |
-| File size                                 | --   | pre- and post-compact |
-
-**CLI flags:** `--n-docs` (2000), `--chunks-per-doc` (3), `--data-file` (`bench_data.pkl`), `--regenerate`, `--ollama-url` (`http://192.168.1.87:11434`), `--embed-model` (`qwen3-embedding:4b`), `--query-iters` (50).
-
-**Dependencies:** `faker` and `httpx` must be available (`uv add --dev faker httpx` if not already installed).
-
-## Risk register (from the 2026-06-10 issues audit)
-
-| Risk                                        | Ref                                     | State          | Disposition                                                                                                                                       |
-| ------------------------------------------- | --------------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
-| 0.1.10+ wheels bake AVX, no dispatch        | release CI change, verified on 0.1.10a4 | current        | Pin 0.1.9; vec_debug canary; upstream ask before any bump                                                                                         |
-| DELETE never reclaims space; VACUUM ~50%    | #54, #220                               | open           | Rebuild-based `compact()` above                                                                                                                   |
-| INSERT OR REPLACE broken on vec0            | #259                                    | open           | Use DELETE+INSERT in txn (design already does)                                                                                                    |
-| NULL metadata rejected                      | #141                                    | open           | Sentinel `""` coercion (already current behavior)                                                                                                 |
-| Partition-key IN returns k per partition    | #142                                    | open           | Avoided: document_id is a metadata column                                                                                                         |
-| NOT IN silently under-delivers              | #116                                    | open           | Never emit NOT IN                                                                                                                                 |
-| Locale strtod breaks JSON vector parsing    | #241                                    | open           | Always BLOB-bind vectors                                                                                                                          |
-| Single weekend maintainer; fix PRs languish | #226                                    | open           | Mitigated by Mozilla sponsorship + Firefox vendoring (release-train consumer); pin + vendor-from-source remains the escape hatch (no sdist: #211) |
-| ANN index = one-way file format             | 0.1.10 alphas                           | —              | Do not adopt ANN until 0.1.10 final + flag audit                                                                                                  |
-| Long-TEXT metadata DELETE bug               | #274                                    | fixed in 0.1.9 | Floor requirement `>=0.1.9` already implied by pin                                                                                                |