mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-07-01 01:34:26 +00:00
a020f64d08
* Chore(beta): add sqlite-vec 0.1.9 dependency Pinned exactly: the 0.1.9 wheels carry no baked SIMD flags (safe on pre-AVX2 CPUs, the point of this migration); the 0.1.10 alphas bake -mavx and would reintroduce the #12970 crash class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test(beta): port vector store tests to sqlite-vec backend Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): switch AI vector store from LanceDB to sqlite-vec Fixes the non-AVX2 SIGILL class (#12970) at the root: lancedb is no longer imported. sqlite-vec 0.1.9 wheels carry no baked SIMD, vec0 metadata columns give parameterized EQ/IN filtering, WAL preserves the lock-free-reader model, and compact() rebuilds the table because vec0 DELETEs never reclaim space. Implementation notes vs. the Task 3A draft: - compact() uses a file-swap approach (new db file + Path.replace) rather than ALTER TABLE RENAME, which does not cascade to shadow tables in sqlite-vec 0.1.9 (upstream limitation). - Bloat is tracked via a cumulative total_inserts counter in index_meta because the _rowids shadow table does not accumulate deleted rows in 0.1.9 (contrary to the design doc assumption from #54). - None distances from the zero-vector cosine edge case are mapped to similarity 0.0 rather than raising TypeError. - Test suite updated accordingly: _bloat_ratio reads index_meta instead of _rowids; seed collision in force-compact test fixed (seed=100.0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): wire indexing pipeline to the sqlite-vec store Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): move filename/storage path/ASN to node metadata Same treatment as title/tags/correspondent in #12944: excluded from the embedded text, visible to the LLM via metadata prepend. Changes embedded text for every document, so it ships inside the sqlite-vec transition, whose forced rebuild re-embeds everything anyway. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test(beta): cover legacy LanceDB index cleanup and forced rebuild Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): drop lancedb dependency Fixes #12970: the package whose wheels SIGILL on non-AVX2 CPUs is no longer installed at all. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): partial pyrefly cleanup on sqlite-vec vector store - Add MetadataFilter import and isinstance guard in _build_where() - Add query_embedding None guard in query() - Fix dict.get() type-checker ambiguity in get_configured_model_name() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): drop automatic LanceDB index cleanup on startup Leave legacy Lance directory removal to the user rather than deleting it automatically on first run. Beta policy: user is expected to do a clean re-embed anyway; no need for the system to silently delete their data. Remove _cleanup_legacy_lance_index(), the forced-rebuild path that called it, and the associated tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): ruff format pass on sqlite-vec AI files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Removes the benchmarking file * Try to resolve or silence some semgrep. But we're using SQL here, not an ORM and we control the inputs, not users * Enhancement(beta): add schema migration machinery to sqlite-vec vector store Adds versioned schema migration support modelled after PR #12968's LanceDB approach, adapted for sqlite-vec's file-swap compaction pattern. - SCHEMA_VERSION = 1 written to index_meta at table creation and preserved through compact() - Migration dataclass with from_version, to_version, kind ("structural" or "re-embed"), description, and an optional apply(src, dst, dim) callable - MIGRATIONS registry (empty at v1 baseline); add entries and bump SCHEMA_VERSION when the schema changes - check_and_run_migrations(): structural migrations run via the same file-swap as compact() (no re-embed); re-embed migrations return True so the caller forces a full rebuild - update_llm_index() calls check_and_run_migrations() under the write lock before any indexing work Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): deduplicate vector store internals via helper methods Extract three helpers to remove copy-paste between compact() and _run_structural_migration(): - _meta_set_on(conn, key, value): static upsert into any connection's index_meta; _meta_set() now delegates to it - _create_vec_table(conn, dim): CREATE VIRTUAL TABLE DDL (carries the nosemgrep annotation) - _swap_in_compact(compact_path, db_path): close/replace/reconnect sequence used by both file-swap callers Also normalises compact() error-path cleanup to unlink(missing_ok=True). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Adds equality test and no covers some defensive error handling stuff * Ensures an embed migration stops the migration chain, just in case * Silence one kind right but not really semgrep * Trims dead assignment * Fix(beta): address Copilot review on sqlite-vec vector store Three findings from the PR review: - compact() failure cleanup now unlinks the temporary .compact-wal and .compact-shm files, matching _run_structural_migration(); previously only the main .compact file was removed. - _build_where() fails closed (1 = 0) when filters are requested but none translate, instead of emitting "()" which is invalid SQL; filters scope document access, so an empty translation must match no rows. - Drop the unused table_name constructor parameter (all SQL hardcodes DEFAULT_TABLE_NAME) and its callers in indexing.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Enhancement(beta): guard sqlite-vec compaction swap against concurrent readers The compaction/migration file swap replaces the database via os.replace, but the -wal/-shm files are keyed by path, not inode. A reader holding an open connection across the swap leaves the old WAL aliased onto the new file; a subsequent write then corrupts the database (reproduced via PRAGMA integrity_check). Add a cross-process read/write lock (filelock.ReadWriteLock) over the index: - read_store() holds it shared for the whole connection lifetime (and closes the connection on exit); concurrent readers do not block. - compaction and the migration check run under an exclusive lock that drains readers, and skip with an info log on Timeout (maintenance op, retries next run). - Normal writes are untouched: WAL gives reader/writer concurrency and LLM_INDEX_LOCK still serializes writers, so they never block readers. load_or_build_index() now takes the store from the caller's read_store() so the lock and connection span the whole retrieval; chat holds it across the streamed response. Two new settings: LLM_INDEX_RWLOCK and LLM_INDEX_COMPACTION_LOCK_TIMEOUT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Ensures the store alays cleans up SQLite connections for any operations, even on errors --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
265 lines
10 KiB
Python
265 lines
10 KiB
Python
from unittest.mock import ANY
|
|
from unittest.mock import MagicMock
|
|
from unittest.mock import patch
|
|
|
|
import pytest
|
|
from django.conf import settings
|
|
|
|
from documents.models import Document
|
|
from paperless.models import LLMEmbeddingBackend
|
|
from paperless_ai.embedding import _normalize_llm_index_text
|
|
from paperless_ai.embedding import build_llm_index_text
|
|
from paperless_ai.embedding import get_configured_model_name
|
|
from paperless_ai.embedding import get_embedding_model
|
|
|
|
|
|
@pytest.fixture
|
|
def mock_ai_config():
|
|
with patch("paperless_ai.embedding.AIConfig") as MockAIConfig:
|
|
MockAIConfig.return_value.llm_embedding_endpoint = None
|
|
MockAIConfig.return_value.llm_allow_internal_endpoints = True
|
|
MockAIConfig.return_value.llm_context_size = 8192
|
|
yield MockAIConfig
|
|
|
|
|
|
@pytest.fixture
|
|
def mock_document():
|
|
doc = MagicMock(spec=Document)
|
|
doc.title = "Test Title"
|
|
doc.filename = "test_file.pdf"
|
|
doc.created = "2023-01-01"
|
|
doc.added = "2023-01-02"
|
|
doc.modified = "2023-01-03"
|
|
|
|
tag1 = MagicMock()
|
|
tag1.name = "Tag1"
|
|
tag2 = MagicMock()
|
|
tag2.name = "Tag2"
|
|
doc.tags.all = MagicMock(return_value=[tag1, tag2])
|
|
|
|
doc.document_type = MagicMock()
|
|
doc.document_type.name = "Invoice"
|
|
doc.correspondent = MagicMock()
|
|
doc.correspondent.name = "Test Correspondent"
|
|
doc.archive_serial_number = "12345"
|
|
doc.content = "This is the document content."
|
|
|
|
cf1 = MagicMock(__str__=lambda x: "Value1")
|
|
cf1.field = MagicMock()
|
|
cf1.field.name = "Field1"
|
|
cf1.value = "Value1"
|
|
cf2 = MagicMock(__str__=lambda x: "Value2")
|
|
cf2.field = MagicMock()
|
|
cf2.field.name = "Field2"
|
|
cf2.value = "Value2"
|
|
doc.custom_fields.all = MagicMock(return_value=[cf1, cf2])
|
|
|
|
return doc
|
|
|
|
|
|
def test_get_embedding_model_openai(mock_ai_config):
|
|
mock_ai_config.return_value.llm_embedding_backend = LLMEmbeddingBackend.OPENAI_LIKE
|
|
mock_ai_config.return_value.llm_embedding_model = "text-embedding-3-small"
|
|
mock_ai_config.return_value.llm_api_key = "test_api_key"
|
|
mock_ai_config.return_value.llm_endpoint = "http://test-url"
|
|
|
|
with patch(
|
|
"llama_index.embeddings.openai_like.OpenAILikeEmbedding",
|
|
) as MockOpenAIEmbedding:
|
|
model = get_embedding_model(mock_ai_config.return_value)
|
|
MockOpenAIEmbedding.assert_called_once_with(
|
|
model_name="text-embedding-3-small",
|
|
api_key="test_api_key",
|
|
api_base="http://test-url",
|
|
http_client=ANY,
|
|
async_http_client=ANY,
|
|
)
|
|
assert model == MockOpenAIEmbedding.return_value
|
|
|
|
|
|
def test_get_embedding_model_openai_prefers_embedding_endpoint(mock_ai_config):
|
|
mock_ai_config.return_value.llm_embedding_backend = LLMEmbeddingBackend.OPENAI_LIKE
|
|
mock_ai_config.return_value.llm_embedding_model = "text-embedding-3-small"
|
|
mock_ai_config.return_value.llm_api_key = "test_api_key"
|
|
mock_ai_config.return_value.llm_embedding_endpoint = "http://embedding-url"
|
|
mock_ai_config.return_value.llm_endpoint = "http://test-url"
|
|
|
|
with patch(
|
|
"llama_index.embeddings.openai_like.OpenAILikeEmbedding",
|
|
) as MockOpenAIEmbedding:
|
|
model = get_embedding_model(mock_ai_config.return_value)
|
|
MockOpenAIEmbedding.assert_called_once_with(
|
|
model_name="text-embedding-3-small",
|
|
api_key="test_api_key",
|
|
api_base="http://embedding-url",
|
|
http_client=ANY,
|
|
async_http_client=ANY,
|
|
)
|
|
assert model == MockOpenAIEmbedding.return_value
|
|
|
|
|
|
def test_get_embedding_model_openai_blocks_internal_endpoint_when_disallowed(
|
|
mock_ai_config,
|
|
):
|
|
mock_ai_config.return_value.llm_embedding_backend = LLMEmbeddingBackend.OPENAI_LIKE
|
|
mock_ai_config.return_value.llm_embedding_model = "text-embedding-3-small"
|
|
mock_ai_config.return_value.llm_api_key = "test_api_key"
|
|
mock_ai_config.return_value.llm_endpoint = "http://127.0.0.1:11434"
|
|
mock_ai_config.return_value.llm_allow_internal_endpoints = False
|
|
|
|
with pytest.raises(ValueError, match="non-public address"):
|
|
get_embedding_model(mock_ai_config.return_value)
|
|
|
|
|
|
def test_get_embedding_model_huggingface(mock_ai_config):
|
|
mock_ai_config.return_value.llm_embedding_backend = LLMEmbeddingBackend.HUGGINGFACE
|
|
mock_ai_config.return_value.llm_embedding_model = (
|
|
"sentence-transformers/all-MiniLM-L6-v2"
|
|
)
|
|
|
|
with patch(
|
|
"llama_index.embeddings.huggingface.HuggingFaceEmbedding",
|
|
) as MockHuggingFaceEmbedding:
|
|
model = get_embedding_model(mock_ai_config.return_value)
|
|
MockHuggingFaceEmbedding.assert_called_once_with(
|
|
model_name="sentence-transformers/all-MiniLM-L6-v2",
|
|
cache_folder=str(settings.DATA_DIR / "hf_cache"),
|
|
)
|
|
assert model == MockHuggingFaceEmbedding.return_value
|
|
|
|
|
|
def test_get_embedding_model_ollama(mock_ai_config):
|
|
mock_ai_config.return_value.llm_embedding_backend = LLMEmbeddingBackend.OLLAMA
|
|
mock_ai_config.return_value.llm_embedding_model = "embeddinggemma"
|
|
mock_ai_config.return_value.llm_endpoint = "http://test-url"
|
|
|
|
with patch(
|
|
"llama_index.embeddings.ollama.OllamaEmbedding",
|
|
) as MockOllamaEmbedding:
|
|
model = get_embedding_model(mock_ai_config.return_value)
|
|
MockOllamaEmbedding.assert_called_once_with(
|
|
model_name="embeddinggemma",
|
|
base_url="http://test-url",
|
|
ollama_additional_kwargs={"num_ctx": 8192},
|
|
)
|
|
assert model == MockOllamaEmbedding.return_value
|
|
|
|
|
|
def test_get_embedding_model_ollama_prefers_embedding_endpoint(mock_ai_config):
|
|
mock_ai_config.return_value.llm_embedding_backend = LLMEmbeddingBackend.OLLAMA
|
|
mock_ai_config.return_value.llm_embedding_model = "embeddinggemma"
|
|
mock_ai_config.return_value.llm_embedding_endpoint = "http://embedding-url"
|
|
mock_ai_config.return_value.llm_endpoint = "http://test-url"
|
|
|
|
with patch(
|
|
"llama_index.embeddings.ollama.OllamaEmbedding",
|
|
) as MockOllamaEmbedding:
|
|
model = get_embedding_model(mock_ai_config.return_value)
|
|
MockOllamaEmbedding.assert_called_once_with(
|
|
model_name="embeddinggemma",
|
|
base_url="http://embedding-url",
|
|
ollama_additional_kwargs={"num_ctx": 8192},
|
|
)
|
|
assert model == MockOllamaEmbedding.return_value
|
|
|
|
|
|
def test_get_embedding_model_ollama_blocks_internal_endpoint_when_disallowed(
|
|
mock_ai_config,
|
|
):
|
|
mock_ai_config.return_value.llm_embedding_backend = LLMEmbeddingBackend.OLLAMA
|
|
mock_ai_config.return_value.llm_embedding_model = "embeddinggemma"
|
|
mock_ai_config.return_value.llm_endpoint = "http://127.0.0.1:11434"
|
|
mock_ai_config.return_value.llm_allow_internal_endpoints = False
|
|
|
|
with pytest.raises(ValueError, match="non-public address"):
|
|
get_embedding_model(mock_ai_config.return_value)
|
|
|
|
|
|
def test_get_embedding_model_invalid_backend(mock_ai_config):
|
|
mock_ai_config.return_value.llm_embedding_backend = "INVALID_BACKEND"
|
|
|
|
with pytest.raises(
|
|
ValueError,
|
|
match="Unsupported embedding backend: INVALID_BACKEND",
|
|
):
|
|
get_embedding_model(mock_ai_config.return_value)
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
("backend", "expected_default"),
|
|
[
|
|
(LLMEmbeddingBackend.OPENAI_LIKE, "text-embedding-3-small"),
|
|
(LLMEmbeddingBackend.HUGGINGFACE, "sentence-transformers/all-MiniLM-L6-v2"),
|
|
(LLMEmbeddingBackend.OLLAMA, "embeddinggemma"),
|
|
],
|
|
)
|
|
def test_get_configured_model_name_falls_back_to_backend_default(
|
|
mock_ai_config,
|
|
backend,
|
|
expected_default,
|
|
):
|
|
"""When no model is explicitly configured, each backend has a distinct default."""
|
|
config = mock_ai_config.return_value
|
|
config.llm_embedding_backend = backend
|
|
config.llm_embedding_model = None
|
|
assert get_configured_model_name(config) == expected_default
|
|
|
|
|
|
def test_get_configured_model_name_explicit_overrides_default(mock_ai_config):
|
|
"""An explicit model name overrides the backend default for all backends."""
|
|
config = mock_ai_config.return_value
|
|
config.llm_embedding_backend = LLMEmbeddingBackend.OPENAI_LIKE
|
|
config.llm_embedding_model = "my-custom-model"
|
|
# The backend default for OPENAI_LIKE is "text-embedding-3-small", so if
|
|
# the explicit name was ignored we'd get the wrong result.
|
|
assert get_configured_model_name(config) == "my-custom-model"
|
|
|
|
|
|
def test_build_llm_index_text(mock_document):
|
|
with patch("documents.models.Note.objects.filter") as mock_notes_filter:
|
|
mock_notes_filter.return_value = [
|
|
MagicMock(note="Note1"),
|
|
MagicMock(note="Note2"),
|
|
]
|
|
|
|
result = build_llm_index_text(mock_document)
|
|
|
|
# Structured fields live in node.metadata for LLM context -- not body text
|
|
assert "Title: Test Title" not in result
|
|
assert "Created: 2023-01-01" not in result
|
|
assert "Tags: Tag1, Tag2" not in result
|
|
assert "Document Type: Invoice" not in result
|
|
assert "Correspondent: Test Correspondent" not in result
|
|
assert "Filename:" not in result
|
|
assert "Storage Path:" not in result
|
|
assert "Archive Serial Number:" not in result
|
|
|
|
# Fields without a metadata equivalent stay in body text
|
|
assert "Notes: Note1,Note2" in result
|
|
assert "Content:\n\nThis is the document content." in result
|
|
assert "Custom Field - Field1: Value1\nCustom Field - Field2: Value2" in result
|
|
|
|
|
|
def test_build_llm_index_text_normalizes_ocr_punctuation_runs(mock_document):
|
|
mock_document.content = (
|
|
"Introduction ................................................ 7\n"
|
|
"Hardware Limitation ________________________________________ 9\n"
|
|
"Keep short punctuation like INV-100 and ellipses..."
|
|
)
|
|
|
|
with patch("documents.models.Note.objects.filter", return_value=[]):
|
|
result = build_llm_index_text(mock_document)
|
|
|
|
assert "Introduction 7" in result
|
|
assert "Hardware Limitation 9" in result
|
|
assert "INV-100" in result
|
|
assert "ellipses..." in result
|
|
|
|
|
|
def test_normalize_llm_index_text_collapses_ocr_leaders_without_joining_lines():
|
|
assert _normalize_llm_index_text("A........B\nC____D----E") == "A B\nC D E"
|
|
|
|
|
|
def test_normalize_llm_index_text_collapses_non_breaking_spaces():
|
|
assert _normalize_llm_index_text("A\u00a0........\u00a0B") == "A B"
|