* Chore(beta): add sqlite-vec 0.1.9 dependency
Pinned exactly: the 0.1.9 wheels carry no baked SIMD flags (safe on
pre-AVX2 CPUs, the point of this migration); the 0.1.10 alphas bake
-mavx and would reintroduce the #12970 crash class.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Test(beta): port vector store tests to sqlite-vec backend
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Enhancement(beta): switch AI vector store from LanceDB to sqlite-vec
Fixes the non-AVX2 SIGILL class (#12970) at the root: lancedb is no
longer imported. sqlite-vec 0.1.9 wheels carry no baked SIMD, vec0
metadata columns give parameterized EQ/IN filtering, WAL preserves the
lock-free-reader model, and compact() rebuilds the table because vec0
DELETEs never reclaim space.
Implementation notes vs. the Task 3A draft:
- compact() uses a file-swap approach (new db file + Path.replace) rather
than ALTER TABLE RENAME, which does not cascade to shadow tables in
sqlite-vec 0.1.9 (upstream limitation).
- Bloat is tracked via a cumulative total_inserts counter in index_meta
because the _rowids shadow table does not accumulate deleted rows in
0.1.9 (contrary to the design doc assumption from #54).
- None distances from the zero-vector cosine edge case are mapped to
similarity 0.0 rather than raising TypeError.
- Test suite updated accordingly: _bloat_ratio reads index_meta instead
of _rowids; seed collision in force-compact test fixed (seed=100.0).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Enhancement(beta): wire indexing pipeline to the sqlite-vec store
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Enhancement(beta): move filename/storage path/ASN to node metadata
Same treatment as title/tags/correspondent in #12944: excluded from
the embedded text, visible to the LLM via metadata prepend. Changes
embedded text for every document, so it ships inside the sqlite-vec
transition, whose forced rebuild re-embeds everything anyway.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Test(beta): cover legacy LanceDB index cleanup and forced rebuild
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Chore(beta): drop lancedb dependency
Fixes#12970: the package whose wheels SIGILL on non-AVX2 CPUs is no
longer installed at all.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Chore(beta): partial pyrefly cleanup on sqlite-vec vector store
- Add MetadataFilter import and isinstance guard in _build_where()
- Add query_embedding None guard in query()
- Fix dict.get() type-checker ambiguity in get_configured_model_name()
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Chore(beta): drop automatic LanceDB index cleanup on startup
Leave legacy Lance directory removal to the user rather than deleting it
automatically on first run. Beta policy: user is expected to do a clean
re-embed anyway; no need for the system to silently delete their data.
Remove _cleanup_legacy_lance_index(), the forced-rebuild path that called
it, and the associated tests.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Chore(beta): ruff format pass on sqlite-vec AI files
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Removes the benchmarking file
* Try to resolve or silence some semgrep. But we're using SQL here, not an ORM and we control the inputs, not users
* Enhancement(beta): add schema migration machinery to sqlite-vec vector store
Adds versioned schema migration support modelled after PR #12968's LanceDB
approach, adapted for sqlite-vec's file-swap compaction pattern.
- SCHEMA_VERSION = 1 written to index_meta at table creation and preserved
through compact()
- Migration dataclass with from_version, to_version, kind ("structural" or
"re-embed"), description, and an optional apply(src, dst, dim) callable
- MIGRATIONS registry (empty at v1 baseline); add entries and bump
SCHEMA_VERSION when the schema changes
- check_and_run_migrations(): structural migrations run via the same
file-swap as compact() (no re-embed); re-embed migrations return True
so the caller forces a full rebuild
- update_llm_index() calls check_and_run_migrations() under the write lock
before any indexing work
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Chore(beta): deduplicate vector store internals via helper methods
Extract three helpers to remove copy-paste between compact() and
_run_structural_migration():
- _meta_set_on(conn, key, value): static upsert into any connection's
index_meta; _meta_set() now delegates to it
- _create_vec_table(conn, dim): CREATE VIRTUAL TABLE DDL (carries the
nosemgrep annotation)
- _swap_in_compact(compact_path, db_path): close/replace/reconnect
sequence used by both file-swap callers
Also normalises compact() error-path cleanup to unlink(missing_ok=True).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Adds equality test and no covers some defensive error handling stuff
* Ensures an embed migration stops the migration chain, just in case
* Silence one kind right but not really semgrep
* Trims dead assignment
* Fix(beta): address Copilot review on sqlite-vec vector store
Three findings from the PR review:
- compact() failure cleanup now unlinks the temporary .compact-wal and
.compact-shm files, matching _run_structural_migration(); previously
only the main .compact file was removed.
- _build_where() fails closed (1 = 0) when filters are requested but none
translate, instead of emitting "()" which is invalid SQL; filters scope
document access, so an empty translation must match no rows.
- Drop the unused table_name constructor parameter (all SQL hardcodes
DEFAULT_TABLE_NAME) and its callers in indexing.py.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Enhancement(beta): guard sqlite-vec compaction swap against concurrent readers
The compaction/migration file swap replaces the database via os.replace,
but the -wal/-shm files are keyed by path, not inode. A reader holding an
open connection across the swap leaves the old WAL aliased onto the new
file; a subsequent write then corrupts the database (reproduced via
PRAGMA integrity_check).
Add a cross-process read/write lock (filelock.ReadWriteLock) over the
index:
- read_store() holds it shared for the whole connection lifetime (and
closes the connection on exit); concurrent readers do not block.
- compaction and the migration check run under an exclusive lock that
drains readers, and skip with an info log on Timeout (maintenance op,
retries next run).
- Normal writes are untouched: WAL gives reader/writer concurrency and
LLM_INDEX_LOCK still serializes writers, so they never block readers.
load_or_build_index() now takes the store from the caller's read_store()
so the lock and connection span the whole retrieval; chat holds it across
the streamed response. Two new settings: LLM_INDEX_RWLOCK and
LLM_INDEX_COMPACTION_LOCK_TIMEOUT.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
* Ensures the store alays cleans up SQLite connections for any operations, even on errors
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: Move LLM index lock outside index dir and skip per-doc tasks on bulk update
Two concurrency bugs from #12893:
[P1] Lock file lived inside LLM_INDEX_DIR. A rebuild calls
shutil.rmtree(LLM_INDEX_DIR), deleting the lock while a worker still
held it. A second worker then acquired a fresh lock on the new path and
ran concurrently, defeating serialisation. Move the lock to
DATA_DIR/locks/llm_index.lock (a new settings constant LLM_INDEX_LOCK)
so rmtree cannot touch it. The locks/ dir is created at settings load
time, matching the existing pattern for LOGGING_DIR.
[P2] document_updated was connected to add_or_update_document_in_llm_index
in apps.py. bulk_update_documents() emits document_updated for every
document in the batch, queuing N per-document LLM tasks, and then also
calls update_llm_index(rebuild=False) once at the end. Pass
skip_ai_index=True when sending document_updated from the bulk path so
the handler skips the per-document enqueue; the existing batch call at
the end of bulk_update_documents is the only LLM update for that path.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: ghost vectors leave KeyError-prone nodes_dict entries after deletion
docstore.delete_document() removes a node from the docstore but leaves its
entry in index_struct.nodes_dict (the FAISS positional-id to node-UUID map).
A subsequent similarity query resolves the ghost position to the deleted UUID,
finds nothing in fetched_nodes_by_id, and raises KeyError inside
_insert_fetched_nodes_into_query_result.
Purge stale nodes_dict entries after each docstore deletion and re-sync the
mutated index_struct into the kvstore so persist() writes the updated mapping.
Dead FAISS vectors remain in the flat index until the next full rebuild
(IndexFlatL2 is append-only); add a try/except KeyError around
retriever.retrieve() as a defensive fallback for any residual ghost positions.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: acquire index lock in query_similar_documents
query_similar_documents() loaded the index and ran the FAISS retriever
without holding the file lock. All write paths (update_llm_index,
llm_index_add_or_update_document, llm_index_remove_document) hold
FileLock(_index_lock_path()), so a concurrent rebuild calling
shutil.rmtree(LLM_INDEX_DIR) while a read is mid-load produces an IOError
or corrupt partial state.
Wrap the load_or_build_index() call and all subsequent retriever work inside
FileLock. The early-return guards (vector_store_file_exists check, empty
allowed_document_ids) remain outside the lock; the DB query for the final
result set also stays outside.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: skip LLM index enqueue on document_updated during version addition
When a document is consumed as a new version of an existing document, the
consumer fires document_consumption_finished (which triggers
add_or_update_document_in_llm_index) and then document_updated for the root
document. Both signals are connected to the same handler, so the root document
was enqueued for LLM indexing twice per version-addition event.
Pass skip_ai_index=True on the consumer's version-addition document_updated
send so the handler's existing guard suppresses the duplicate enqueue.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Test: bulk_update_documents must not enqueue per-doc LLM tasks
With AI enabled, bulk_update_documents() sends document_updated for every
document in the batch. The skip_ai_index=True kwarg (added in the P2 fix)
prevents add_or_update_document_in_llm_index from enqueuing a per-document
task for each one. Only the single update_llm_index call at the end should run.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Debug level log sure
* Update src/paperless_ai/indexing.py
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
* Apply suggestion from @shamoon
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
* Fix: Remove all nodes for multi-chunk documents in update_llm_index incremental path
The existing_nodes dict comprehension keyed on document_id silently dropped all
but the last node per document, so only that one node was deleted when a
modified document was re-indexed, leaving all other chunks as ghost vectors in
the FAISS index. Switch to a defaultdict(list) that collects every node per
document_id, then iterate and delete all of them before inserting fresh nodes.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: Wire document_updated signal to LLM index update handler
Connect document_updated to add_or_update_document_in_llm_index in
DocumentsConfig.ready() so REST API edits (PATCH /api/documents/{id}/)
enqueue an LLM vector store update, matching the existing
document_consumption_finished behavior.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: Add file lock around FAISS index mutations to prevent concurrent write corruption
Two concurrent Celery workers calling llm_index_add_or_update_document or
llm_index_remove_document each loaded the same on-disk index independently,
made their own change, and the last writer silently overwrote the first's
update. Wrap both functions and the rebuild/persist body of update_llm_index
in a filelock.FileLock keyed on LLM_INDEX_DIR/index.lock. Add a TOCTOU
comment on queue_llm_index_update_if_needed explaining the residual risk
(duplicate rebuild tasks are wasteful but not corrupting because the lock
serialises the actual write).
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: Apply _normalize() in extract_unmatched_names to prevent duplicate suggestions
extract_unmatched_names was using .lower() while _match_names_to_queryset
uses _normalize() (which also strips punctuation). A name like "J. Smith"
matched to existing correspondent "J Smith" would still appear in the
unmatched list, causing duplicate object creation.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: Skip LLM index update gracefully when document has no indexable content
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: Persist empty index when all documents are deleted to clear stale FAISS vectors
The early-return guard in update_llm_index fired before persist() when no
documents existed, leaving a stale on-disk FAISS index that returned phantom
hits for deleted document IDs. Now the guard only returns early for the
incremental (rebuild=False) path when no index exists on disk; the rebuild
path always continues through to persist(), producing an empty clean index.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Chore: Simplify incremental index update — use docs.values() and deduplicate node extend
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: Validate and limit chat question input in ChatStreamingView
Add max_length=4000 to ChatStreamingSerializer.q and replace the bare
request.data["q"] read with proper serializer.is_valid(raise_exception=True)
so oversized or missing questions are rejected with HTTP 400 before
reaching the LLM.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Fix: Add defensive prompt framing to mark document content as untrusted
* Also adds a system prompt which is treated higher that this is untrusted stuff
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>