mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-07-02 10:14:17 +00:00
a020f64d08
* Chore(beta): add sqlite-vec 0.1.9 dependency Pinned exactly: the 0.1.9 wheels carry no baked SIMD flags (safe on pre-AVX2 CPUs, the point of this migration); the 0.1.10 alphas bake -mavx and would reintroduce the #12970 crash class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test(beta): port vector store tests to sqlite-vec backend Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): switch AI vector store from LanceDB to sqlite-vec Fixes the non-AVX2 SIGILL class (#12970) at the root: lancedb is no longer imported. sqlite-vec 0.1.9 wheels carry no baked SIMD, vec0 metadata columns give parameterized EQ/IN filtering, WAL preserves the lock-free-reader model, and compact() rebuilds the table because vec0 DELETEs never reclaim space. Implementation notes vs. the Task 3A draft: - compact() uses a file-swap approach (new db file + Path.replace) rather than ALTER TABLE RENAME, which does not cascade to shadow tables in sqlite-vec 0.1.9 (upstream limitation). - Bloat is tracked via a cumulative total_inserts counter in index_meta because the _rowids shadow table does not accumulate deleted rows in 0.1.9 (contrary to the design doc assumption from #54). - None distances from the zero-vector cosine edge case are mapped to similarity 0.0 rather than raising TypeError. - Test suite updated accordingly: _bloat_ratio reads index_meta instead of _rowids; seed collision in force-compact test fixed (seed=100.0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): wire indexing pipeline to the sqlite-vec store Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): move filename/storage path/ASN to node metadata Same treatment as title/tags/correspondent in #12944: excluded from the embedded text, visible to the LLM via metadata prepend. Changes embedded text for every document, so it ships inside the sqlite-vec transition, whose forced rebuild re-embeds everything anyway. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test(beta): cover legacy LanceDB index cleanup and forced rebuild Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): drop lancedb dependency Fixes #12970: the package whose wheels SIGILL on non-AVX2 CPUs is no longer installed at all. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): partial pyrefly cleanup on sqlite-vec vector store - Add MetadataFilter import and isinstance guard in _build_where() - Add query_embedding None guard in query() - Fix dict.get() type-checker ambiguity in get_configured_model_name() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): drop automatic LanceDB index cleanup on startup Leave legacy Lance directory removal to the user rather than deleting it automatically on first run. Beta policy: user is expected to do a clean re-embed anyway; no need for the system to silently delete their data. Remove _cleanup_legacy_lance_index(), the forced-rebuild path that called it, and the associated tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): ruff format pass on sqlite-vec AI files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Removes the benchmarking file * Try to resolve or silence some semgrep. But we're using SQL here, not an ORM and we control the inputs, not users * Enhancement(beta): add schema migration machinery to sqlite-vec vector store Adds versioned schema migration support modelled after PR #12968's LanceDB approach, adapted for sqlite-vec's file-swap compaction pattern. - SCHEMA_VERSION = 1 written to index_meta at table creation and preserved through compact() - Migration dataclass with from_version, to_version, kind ("structural" or "re-embed"), description, and an optional apply(src, dst, dim) callable - MIGRATIONS registry (empty at v1 baseline); add entries and bump SCHEMA_VERSION when the schema changes - check_and_run_migrations(): structural migrations run via the same file-swap as compact() (no re-embed); re-embed migrations return True so the caller forces a full rebuild - update_llm_index() calls check_and_run_migrations() under the write lock before any indexing work Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): deduplicate vector store internals via helper methods Extract three helpers to remove copy-paste between compact() and _run_structural_migration(): - _meta_set_on(conn, key, value): static upsert into any connection's index_meta; _meta_set() now delegates to it - _create_vec_table(conn, dim): CREATE VIRTUAL TABLE DDL (carries the nosemgrep annotation) - _swap_in_compact(compact_path, db_path): close/replace/reconnect sequence used by both file-swap callers Also normalises compact() error-path cleanup to unlink(missing_ok=True). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Adds equality test and no covers some defensive error handling stuff * Ensures an embed migration stops the migration chain, just in case * Silence one kind right but not really semgrep * Trims dead assignment * Fix(beta): address Copilot review on sqlite-vec vector store Three findings from the PR review: - compact() failure cleanup now unlinks the temporary .compact-wal and .compact-shm files, matching _run_structural_migration(); previously only the main .compact file was removed. - _build_where() fails closed (1 = 0) when filters are requested but none translate, instead of emitting "()" which is invalid SQL; filters scope document access, so an empty translation must match no rows. - Drop the unused table_name constructor parameter (all SQL hardcodes DEFAULT_TABLE_NAME) and its callers in indexing.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Enhancement(beta): guard sqlite-vec compaction swap against concurrent readers The compaction/migration file swap replaces the database via os.replace, but the -wal/-shm files are keyed by path, not inode. A reader holding an open connection across the swap leaves the old WAL aliased onto the new file; a subsequent write then corrupts the database (reproduced via PRAGMA integrity_check). Add a cross-process read/write lock (filelock.ReadWriteLock) over the index: - read_store() holds it shared for the whole connection lifetime (and closes the connection on exit); concurrent readers do not block. - compaction and the migration check run under an exclusive lock that drains readers, and skip with an info log on Timeout (maintenance op, retries next run). - Normal writes are untouched: WAL gives reader/writer concurrency and LLM_INDEX_LOCK still serializes writers, so they never block readers. load_or_build_index() now takes the store from the caller's read_store() so the lock and connection span the whole retrieval; chat holds it across the streamed response. Two new settings: LLM_INDEX_RWLOCK and LLM_INDEX_COMPACTION_LOCK_TIMEOUT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Ensures the store alays cleans up SQLite connections for any operations, even on errors --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
289 lines
10 KiB
Python
289 lines
10 KiB
Python
import json
|
|
from unittest.mock import MagicMock
|
|
from unittest.mock import patch
|
|
|
|
import pytest
|
|
from llama_index.core import settings as llama_settings
|
|
from llama_index.core.embeddings.mock_embed_model import MockEmbedding
|
|
from llama_index.core.schema import TextNode
|
|
|
|
from documents.tests.factories import DocumentFactory
|
|
from paperless_ai import chat
|
|
from paperless_ai import indexing
|
|
from paperless_ai.chat import CHAT_ERROR_MESSAGE
|
|
from paperless_ai.chat import CHAT_METADATA_DELIMITER
|
|
from paperless_ai.chat import stream_chat_with_documents
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
def patch_embed_model():
|
|
# Use a real BaseEmbedding subclass to satisfy llama-index 0.14 validation
|
|
llama_settings.Settings.embed_model = MockEmbedding(embed_dim=1536)
|
|
yield
|
|
llama_settings.Settings.embed_model = None
|
|
|
|
|
|
@pytest.fixture(autouse=True)
|
|
def patch_embed_nodes():
|
|
with patch(
|
|
"llama_index.core.indices.vector_store.base.embed_nodes",
|
|
) as mock_embed_nodes:
|
|
mock_embed_nodes.side_effect = lambda nodes, *_args, **_kwargs: {
|
|
node.node_id: [0.1] * 1536 for node in nodes
|
|
}
|
|
yield mock_embed_nodes
|
|
|
|
|
|
@pytest.fixture
|
|
def mock_document():
|
|
doc = MagicMock()
|
|
doc.pk = 1
|
|
doc.title = "Test Document"
|
|
doc.filename = "test_file.pdf"
|
|
doc.content = "This is the document content."
|
|
return doc
|
|
|
|
|
|
def assert_chat_output(
|
|
output: list[str],
|
|
*,
|
|
expected_chunks: list[str],
|
|
expected_references: list[dict[str, int | str]],
|
|
) -> None:
|
|
assert output[:-1] == expected_chunks
|
|
|
|
trailer = output[-1]
|
|
assert trailer.startswith(CHAT_METADATA_DELIMITER)
|
|
assert json.loads(trailer.removeprefix(CHAT_METADATA_DELIMITER)) == {
|
|
"references": expected_references,
|
|
}
|
|
|
|
|
|
@pytest.mark.django_db
|
|
def test_stream_chat_with_one_document_retrieval(
|
|
mock_document,
|
|
patch_embed_nodes,
|
|
) -> None:
|
|
with (
|
|
patch("paperless_ai.chat.AIClient") as mock_client_cls,
|
|
patch("paperless_ai.chat.load_or_build_index") as mock_load_index,
|
|
patch(
|
|
"llama_index.core.query_engine.RetrieverQueryEngine.from_args",
|
|
) as mock_query_engine_cls,
|
|
):
|
|
mock_client = MagicMock()
|
|
mock_client_cls.return_value = mock_client
|
|
mock_client.llm = MagicMock()
|
|
|
|
mock_node = TextNode(
|
|
text="This is node content.",
|
|
metadata={"document_id": str(mock_document.pk), "title": "Test Document"},
|
|
)
|
|
mock_index = MagicMock()
|
|
# Simulate get_nodes returning nodes (content exists)
|
|
mock_index.vector_store.get_nodes.return_value = [mock_node]
|
|
mock_load_index.return_value = mock_index
|
|
|
|
mock_retriever_instance = MagicMock()
|
|
mock_retriever_instance.retrieve.return_value = [
|
|
MagicMock(
|
|
metadata={
|
|
"document_id": str(mock_document.pk),
|
|
"title": "Test Document",
|
|
},
|
|
),
|
|
]
|
|
|
|
mock_response_stream = MagicMock()
|
|
mock_response_stream.response_gen = iter(["chunk1", "chunk2"])
|
|
mock_query_engine = MagicMock()
|
|
mock_query_engine_cls.return_value = mock_query_engine
|
|
mock_query_engine.query.return_value = mock_response_stream
|
|
|
|
with patch(
|
|
"llama_index.core.retrievers.VectorIndexRetriever",
|
|
return_value=mock_retriever_instance,
|
|
):
|
|
output = list(stream_chat_with_documents("What is this?", [mock_document]))
|
|
|
|
mock_query_engine.query.assert_called_once_with("What is this?")
|
|
patch_embed_nodes.assert_not_called()
|
|
assert_chat_output(
|
|
output,
|
|
expected_chunks=["chunk1", "chunk2"],
|
|
expected_references=[
|
|
{"id": mock_document.pk, "title": "Test Document"},
|
|
],
|
|
)
|
|
|
|
|
|
@pytest.mark.django_db
|
|
def test_stream_chat_with_multiple_documents_retrieval(patch_embed_nodes) -> None:
|
|
with (
|
|
patch("paperless_ai.chat.AIClient") as mock_client_cls,
|
|
patch("paperless_ai.chat.load_or_build_index") as mock_load_index,
|
|
patch(
|
|
"llama_index.core.query_engine.RetrieverQueryEngine.from_args",
|
|
) as mock_query_engine_cls,
|
|
):
|
|
mock_client = MagicMock()
|
|
mock_client_cls.return_value = mock_client
|
|
mock_client.llm = MagicMock()
|
|
|
|
mock_node1 = TextNode(
|
|
text="Content for doc 1.",
|
|
metadata={"document_id": "1", "title": "Document 1"},
|
|
)
|
|
mock_node2 = TextNode(
|
|
text="Content for doc 2.",
|
|
metadata={"document_id": "2", "title": "Document 2"},
|
|
)
|
|
mock_index = MagicMock()
|
|
# Simulate get_nodes returning nodes (content exists)
|
|
mock_index.vector_store.get_nodes.return_value = [mock_node1, mock_node2]
|
|
mock_load_index.return_value = mock_index
|
|
|
|
mock_retriever_instance = MagicMock()
|
|
mock_retriever_instance.retrieve.return_value = [
|
|
MagicMock(metadata={"document_id": "1", "title": "Document 1"}),
|
|
MagicMock(metadata={"document_id": "2", "title": "Document 2"}),
|
|
]
|
|
|
|
mock_response_stream = MagicMock()
|
|
mock_response_stream.response_gen = iter(["chunk1", "chunk2"])
|
|
|
|
mock_query_engine = MagicMock()
|
|
mock_query_engine_cls.return_value = mock_query_engine
|
|
mock_query_engine.query.return_value = mock_response_stream
|
|
|
|
doc1 = MagicMock(pk=1, title="Document 1", filename="doc1.pdf")
|
|
doc2 = MagicMock(pk=2, title="Document 2", filename="doc2.pdf")
|
|
|
|
with patch(
|
|
"llama_index.core.retrievers.VectorIndexRetriever",
|
|
return_value=mock_retriever_instance,
|
|
):
|
|
output = list(stream_chat_with_documents("What's up?", [doc1, doc2]))
|
|
|
|
mock_query_engine.query.assert_called_once_with("What's up?")
|
|
patch_embed_nodes.assert_not_called()
|
|
assert_chat_output(
|
|
output,
|
|
expected_chunks=["chunk1", "chunk2"],
|
|
expected_references=[
|
|
{"id": 1, "title": "Document 1"},
|
|
{"id": 2, "title": "Document 2"},
|
|
],
|
|
)
|
|
|
|
|
|
def test_stream_chat_empty_document_list() -> None:
|
|
with patch("paperless_ai.chat.load_or_build_index") as mock_load_index:
|
|
output = list(stream_chat_with_documents("Any info?", []))
|
|
mock_load_index.assert_not_called()
|
|
assert output == ["Sorry, I couldn't find any content to answer your question."]
|
|
|
|
|
|
def test_stream_chat_no_matching_nodes() -> None:
|
|
with (
|
|
patch("paperless_ai.chat.AIConfig"),
|
|
patch("paperless_ai.chat.AIClient") as mock_client_cls,
|
|
patch("paperless_ai.chat.load_or_build_index") as mock_load_index,
|
|
):
|
|
mock_client = MagicMock()
|
|
mock_client_cls.return_value = mock_client
|
|
mock_client.llm = MagicMock()
|
|
|
|
mock_index = MagicMock()
|
|
# No matching nodes in the store
|
|
mock_index.vector_store.get_nodes.return_value = []
|
|
mock_load_index.return_value = mock_index
|
|
|
|
output = list(stream_chat_with_documents("Any info?", [MagicMock(pk=1)]))
|
|
|
|
assert output == ["Sorry, I couldn't find any content to answer your question."]
|
|
|
|
|
|
def test_stream_chat_unexpected_failure_returns_generic_error(caplog) -> None:
|
|
with (
|
|
patch("paperless_ai.chat.AIConfig"),
|
|
patch("paperless_ai.chat.AIClient") as mock_client_cls,
|
|
patch("paperless_ai.chat.load_or_build_index") as mock_load_index,
|
|
):
|
|
mock_client = MagicMock()
|
|
mock_client_cls.return_value = mock_client
|
|
mock_client.llm = MagicMock()
|
|
|
|
mock_index = MagicMock()
|
|
# Nodes found so we get past the pre-check
|
|
mock_index.vector_store.get_nodes.return_value = [MagicMock()]
|
|
mock_load_index.return_value = mock_index
|
|
|
|
with patch(
|
|
"llama_index.core.retrievers.VectorIndexRetriever",
|
|
) as mock_retriever_cls:
|
|
mock_retriever = MagicMock()
|
|
mock_retriever.retrieve.side_effect = RuntimeError(
|
|
"private provider detail",
|
|
)
|
|
mock_retriever_cls.return_value = mock_retriever
|
|
|
|
output = list(stream_chat_with_documents("Any info?", [MagicMock(pk=1)]))
|
|
|
|
assert output == [CHAT_ERROR_MESSAGE]
|
|
assert "Failed to stream document chat response" in caplog.text
|
|
assert "private provider detail" in caplog.text
|
|
|
|
|
|
@pytest.mark.django_db
|
|
class TestStreamChatRetrieval:
|
|
def test_no_nodes_yields_no_content_message(
|
|
self,
|
|
temp_llm_index_dir,
|
|
mock_embed_model,
|
|
) -> None:
|
|
doc = DocumentFactory.create(content="hello world")
|
|
# Nothing indexed for this document yet.
|
|
out = list(chat.stream_chat_with_documents("question?", [doc]))
|
|
assert chat.CHAT_NO_CONTENT_MESSAGE in out
|
|
|
|
def test_chat_filter_contains_only_requested_document_ids(
|
|
self,
|
|
temp_llm_index_dir,
|
|
mock_embed_model,
|
|
mocker,
|
|
) -> None:
|
|
"""The MetadataFilter passed to the retriever must be scoped to the
|
|
requested documents only — content from other indexed documents must
|
|
not be surfaced.
|
|
"""
|
|
included = DocumentFactory.create(content="included document content")
|
|
excluded = DocumentFactory.create(content="excluded document content")
|
|
indexing.llm_index_add_or_update_document(included)
|
|
indexing.llm_index_add_or_update_document(excluded)
|
|
|
|
# VectorIndexRetriever is imported inside _stream_chat_with_documents;
|
|
# patch it at the llama_index source so the lazy import picks it up.
|
|
captured_filters = []
|
|
mock_retriever = mocker.MagicMock()
|
|
mock_retriever.retrieve.return_value = []
|
|
|
|
def capture_retriever(*args, **kwargs):
|
|
captured_filters.append(kwargs.get("filters"))
|
|
return mock_retriever
|
|
|
|
mocker.patch("paperless_ai.chat.AIClient")
|
|
mocker.patch(
|
|
"llama_index.core.retrievers.VectorIndexRetriever",
|
|
side_effect=capture_retriever,
|
|
)
|
|
|
|
list(chat.stream_chat_with_documents("question?", [included]))
|
|
|
|
assert captured_filters, "VectorIndexRetriever was never constructed"
|
|
filt = captured_filters[0]
|
|
assert filt is not None, "Retriever must receive a MetadataFilters"
|
|
filter_values = filt.filters[0].value
|
|
assert str(included.pk) in filter_values
|
|
assert str(excluded.pk) not in filter_values
|