- chat.py: use sorted() for doc_ids in the MetadataFilters IN clause,
matching the same pattern used in query_similar_documents. Ensures
deterministic filter construction regardless of document iteration order.
- test_chat.py: add test_chat_filter_contains_only_requested_document_ids
verifying that the retriever receives a filter scoped only to the
requested documents (not all indexed documents). Inspired by
test_document_filtered_retriever_applies_lancedb_metadata_filter in
origin/feature/beta-lancedb.
Co-Authored-By: shamoon <shamoon@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All mutating index operations (upsert, delete, rebuild, compact) now use
with write_store() as store:
instead of explicit FileLock + get_vector_store() at each call site.
Read paths continue to use get_vector_store() directly (no lock needed).
Also type-annotates test fixture params throughout.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Rename vector_store_file_exists -> llm_index_exists (accurate now)
- Rename _iter_existing_modified -> _stored_modified_times; project away
vector column (cheap scan) and return dict[doc_id, modified_str] directly
- Drop _index_lock_path() indirection; inline settings.LLM_INDEX_LOCK
- Move LLM_INDEX_LOCK alongside the index dir (drop_table is safe; no rmtree)
- Drop current_embedding_dim() redirect; callers use get_embedding_dim()
- Drop lazy-import explanatory comments (constraint lives in CLAUDE.md)
- Batch embedding calls via get_text_embedding_batch() in all three loops
- get_nodes: raise NotImplementedError for node_ids (was silently ignored)
- has_nodes(): cheap limit(1) existence check; chat.py uses it instead of
get_nodes() which materialized all matching rows
- conftest: use mocker fixture (pytest-mock) instead of bare patch; add
LLM_INDEX_LOCK to temp_llm_index_dir override; type-annotate mock_embed_model
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Remove tests that validated removed internals (get_or_create_storage_context,
remove_document_docstore_nodes, index.docstore.docs) and rewrite the remaining
ones to assert against the LanceDB store directly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Delete _get_document_filtered_retriever (74-line custom FAISS retriever
with expanding top_k loop) and rewrite _stream_chat_with_documents to use
a stock VectorIndexRetriever with MetadataFilters(IN). The no-content
pre-check now calls index.vector_store.get_nodes(filters=...) which
returns [] cleanly for un-indexed documents. Move FakeEmbedding and
mock_embed_model fixture to conftest.py so both test_chat.py and
test_ai_indexing.py share them.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drops migrate_stale_faiss_index (users delete llm_index/ manually on upgrade).
Keeps embedding_dim_mismatch to force a rebuild when the model changes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds current_embedding_dim() to embedding.py, migrate_stale_faiss_index()
and embedding_dim_mismatch() to indexing.py, and wires both into
update_llm_index so that stale FAISS directories are wiped on startup and
embedding model changes force a full index rebuild.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace get_or_create_storage_context with get_vector_store() (lazy import
of paperless_ai.vector_store inside the function), rewrite load_or_build_index
to use VectorStoreIndex.from_vector_store, and rewrite vector_store_file_exists
to use store.table_exists(). Add LLM_INDEX_TABLE constant and TYPE_CHECKING-only
import of PaperlessLanceVectorStore. Delete remove_document_docstore_nodes and
rewire llm_index_add_or_update_document, llm_index_remove_document, and
update_llm_index to use upsert_document/delete/drop_table on the LanceDB store.
Serialize tags list as JSON string to satisfy flat_metadata validation. Add
test_get_vector_store_roundtrip, test_add_then_remove_document,
test_update_shrinks_chunks_without_orphans, and the subprocess lazy-import guard.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Set id_=str(document.id) on the LlamaDocument constructor in
build_document_node so that every chunk node's ref_doc_id equals the
paperless document pk, enabling the LanceDB adapter's delete(str(doc.id))
and doc_id column to work correctly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
When upsert_document receives an empty nodes list, delete existing
chunks using the document_id column directly (consistent with the
merge_insert prune predicate) rather than calling delete() which
filters on doc_id. Guard for a missing table to avoid a no-op.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Fix delete() to use single-quote delimiter consistent with _escape
- Fix _distance comment: L2 not squared-L2
- Fix similarity_top_k zero-guard to use explicit None check
- Replace deprecated table_names() with list_tables().tables (lancedb 0.33)
- Add add() Sequence[BaseNode] signature with collections.abc.Sequence import
- Add test_build_where_or_condition for OR filter branch coverage
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Task-by-task TDD plan implementing the LanceDB design spec: dependency
swap, the PaperlessLanceVectorStore adapter, atomic merge_insert upsert,
ANN threshold + scalar index + compaction, the indexing/chat/similar
rewires, FAISS migration, and a lazy-import guard test so non-AI paths
(management commands) never drag in llama_index/lancedb/pyarrow.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Spec for swapping the AI feature's llama-index FAISS StorageContext trio
(FaissVectorStore + SimpleDocumentStore + SimpleIndexStore) for LanceDB via
a custom BasePydanticVectorStore adapter (no llama-index-vector-stores-lancedb,
no pandas).
Covers: disk-resident memory-mapped storage, native merge_insert upsert with
when_not_matched_by_source_delete, MetadataFilters(IN) filtering on a top-level
document_id column, auto IVF ANN threshold (IVF_FLAT fallback), MVCC compaction
via optimize(cleanup_older_than=...), migration, concurrency, and testing.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>