mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-06-30 09:14:17 +00:00
a020f64d08
* Chore(beta): add sqlite-vec 0.1.9 dependency Pinned exactly: the 0.1.9 wheels carry no baked SIMD flags (safe on pre-AVX2 CPUs, the point of this migration); the 0.1.10 alphas bake -mavx and would reintroduce the #12970 crash class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test(beta): port vector store tests to sqlite-vec backend Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): switch AI vector store from LanceDB to sqlite-vec Fixes the non-AVX2 SIGILL class (#12970) at the root: lancedb is no longer imported. sqlite-vec 0.1.9 wheels carry no baked SIMD, vec0 metadata columns give parameterized EQ/IN filtering, WAL preserves the lock-free-reader model, and compact() rebuilds the table because vec0 DELETEs never reclaim space. Implementation notes vs. the Task 3A draft: - compact() uses a file-swap approach (new db file + Path.replace) rather than ALTER TABLE RENAME, which does not cascade to shadow tables in sqlite-vec 0.1.9 (upstream limitation). - Bloat is tracked via a cumulative total_inserts counter in index_meta because the _rowids shadow table does not accumulate deleted rows in 0.1.9 (contrary to the design doc assumption from #54). - None distances from the zero-vector cosine edge case are mapped to similarity 0.0 rather than raising TypeError. - Test suite updated accordingly: _bloat_ratio reads index_meta instead of _rowids; seed collision in force-compact test fixed (seed=100.0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): wire indexing pipeline to the sqlite-vec store Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): move filename/storage path/ASN to node metadata Same treatment as title/tags/correspondent in #12944: excluded from the embedded text, visible to the LLM via metadata prepend. Changes embedded text for every document, so it ships inside the sqlite-vec transition, whose forced rebuild re-embeds everything anyway. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test(beta): cover legacy LanceDB index cleanup and forced rebuild Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): drop lancedb dependency Fixes #12970: the package whose wheels SIGILL on non-AVX2 CPUs is no longer installed at all. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): partial pyrefly cleanup on sqlite-vec vector store - Add MetadataFilter import and isinstance guard in _build_where() - Add query_embedding None guard in query() - Fix dict.get() type-checker ambiguity in get_configured_model_name() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): drop automatic LanceDB index cleanup on startup Leave legacy Lance directory removal to the user rather than deleting it automatically on first run. Beta policy: user is expected to do a clean re-embed anyway; no need for the system to silently delete their data. Remove _cleanup_legacy_lance_index(), the forced-rebuild path that called it, and the associated tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): ruff format pass on sqlite-vec AI files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Removes the benchmarking file * Try to resolve or silence some semgrep. But we're using SQL here, not an ORM and we control the inputs, not users * Enhancement(beta): add schema migration machinery to sqlite-vec vector store Adds versioned schema migration support modelled after PR #12968's LanceDB approach, adapted for sqlite-vec's file-swap compaction pattern. - SCHEMA_VERSION = 1 written to index_meta at table creation and preserved through compact() - Migration dataclass with from_version, to_version, kind ("structural" or "re-embed"), description, and an optional apply(src, dst, dim) callable - MIGRATIONS registry (empty at v1 baseline); add entries and bump SCHEMA_VERSION when the schema changes - check_and_run_migrations(): structural migrations run via the same file-swap as compact() (no re-embed); re-embed migrations return True so the caller forces a full rebuild - update_llm_index() calls check_and_run_migrations() under the write lock before any indexing work Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): deduplicate vector store internals via helper methods Extract three helpers to remove copy-paste between compact() and _run_structural_migration(): - _meta_set_on(conn, key, value): static upsert into any connection's index_meta; _meta_set() now delegates to it - _create_vec_table(conn, dim): CREATE VIRTUAL TABLE DDL (carries the nosemgrep annotation) - _swap_in_compact(compact_path, db_path): close/replace/reconnect sequence used by both file-swap callers Also normalises compact() error-path cleanup to unlink(missing_ok=True). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Adds equality test and no covers some defensive error handling stuff * Ensures an embed migration stops the migration chain, just in case * Silence one kind right but not really semgrep * Trims dead assignment * Fix(beta): address Copilot review on sqlite-vec vector store Three findings from the PR review: - compact() failure cleanup now unlinks the temporary .compact-wal and .compact-shm files, matching _run_structural_migration(); previously only the main .compact file was removed. - _build_where() fails closed (1 = 0) when filters are requested but none translate, instead of emitting "()" which is invalid SQL; filters scope document access, so an empty translation must match no rows. - Drop the unused table_name constructor parameter (all SQL hardcodes DEFAULT_TABLE_NAME) and its callers in indexing.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Enhancement(beta): guard sqlite-vec compaction swap against concurrent readers The compaction/migration file swap replaces the database via os.replace, but the -wal/-shm files are keyed by path, not inode. A reader holding an open connection across the swap leaves the old WAL aliased onto the new file; a subsequent write then corrupts the database (reproduced via PRAGMA integrity_check). Add a cross-process read/write lock (filelock.ReadWriteLock) over the index: - read_store() holds it shared for the whole connection lifetime (and closes the connection on exit); concurrent readers do not block. - compaction and the migration check run under an exclusive lock that drains readers, and skip with an info log on Timeout (maintenance op, retries next run). - Normal writes are untouched: WAL gives reader/writer concurrency and LLM_INDEX_LOCK still serializes writers, so they never block readers. load_or_build_index() now takes the store from the caller's read_store() so the lock and connection span the whole retrieval; chat holds it across the streamed response. Two new settings: LLM_INDEX_RWLOCK and LLM_INDEX_COMPACTION_LOCK_TIMEOUT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Ensures the store alays cleans up SQLite connections for any operations, even on errors --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
134 lines
5.0 KiB
Python
134 lines
5.0 KiB
Python
import re
|
|
from typing import TYPE_CHECKING
|
|
|
|
from django.conf import settings
|
|
|
|
if TYPE_CHECKING:
|
|
from llama_index.core.base.embeddings.base import BaseEmbedding
|
|
|
|
from documents.models import Document
|
|
from documents.models import Note
|
|
from paperless.config import AIConfig
|
|
from paperless.models import LLMEmbeddingBackend
|
|
from paperless.network import PinnedHostAsyncHTTPTransport
|
|
from paperless.network import PinnedHostHTTPTransport
|
|
from paperless.network import create_pinned_async_httpx_client
|
|
from paperless.network import create_pinned_httpx_client
|
|
from paperless.network import validate_outbound_http_url
|
|
|
|
OCR_LEADER_REGEX = re.compile(r"[._\-\u00b7]{4,}")
|
|
HORIZONTAL_WHITESPACE_REGEX = re.compile(r"[ \t\u00a0]+")
|
|
|
|
|
|
def get_embedding_model(config: AIConfig) -> "BaseEmbedding":
|
|
match config.llm_embedding_backend:
|
|
case LLMEmbeddingBackend.OPENAI_LIKE:
|
|
from llama_index.embeddings.openai_like import OpenAILikeEmbedding
|
|
|
|
endpoint = config.llm_embedding_endpoint or config.llm_endpoint or None
|
|
http_client = None
|
|
async_http_client = None
|
|
if endpoint:
|
|
http_client = create_pinned_httpx_client(
|
|
endpoint,
|
|
allow_internal=config.llm_allow_internal_endpoints,
|
|
)
|
|
async_http_client = create_pinned_async_httpx_client(
|
|
endpoint,
|
|
allow_internal=config.llm_allow_internal_endpoints,
|
|
)
|
|
return OpenAILikeEmbedding(
|
|
model_name=config.llm_embedding_model or "text-embedding-3-small",
|
|
api_key=config.llm_api_key,
|
|
api_base=endpoint,
|
|
http_client=http_client,
|
|
async_http_client=async_http_client,
|
|
)
|
|
case LLMEmbeddingBackend.HUGGINGFACE:
|
|
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
|
|
|
return HuggingFaceEmbedding(
|
|
model_name=config.llm_embedding_model
|
|
or "sentence-transformers/all-MiniLM-L6-v2",
|
|
cache_folder=str(settings.DATA_DIR / "hf_cache"),
|
|
)
|
|
case LLMEmbeddingBackend.OLLAMA:
|
|
from llama_index.embeddings.ollama import OllamaEmbedding
|
|
from ollama import AsyncClient
|
|
from ollama import Client
|
|
|
|
endpoint = (
|
|
config.llm_embedding_endpoint
|
|
or config.llm_endpoint
|
|
or "http://localhost:11434"
|
|
)
|
|
validate_outbound_http_url(
|
|
endpoint,
|
|
allow_internal=config.llm_allow_internal_endpoints,
|
|
)
|
|
embedding = OllamaEmbedding(
|
|
model_name=config.llm_embedding_model or "embeddinggemma",
|
|
base_url=endpoint,
|
|
ollama_additional_kwargs={"num_ctx": config.llm_context_size},
|
|
)
|
|
embedding._client = Client(
|
|
host=endpoint,
|
|
transport=PinnedHostHTTPTransport(
|
|
allow_internal=config.llm_allow_internal_endpoints,
|
|
),
|
|
)
|
|
embedding._async_client = AsyncClient(
|
|
host=endpoint,
|
|
transport=PinnedHostAsyncHTTPTransport(
|
|
allow_internal=config.llm_allow_internal_endpoints,
|
|
),
|
|
)
|
|
return embedding
|
|
case _:
|
|
raise ValueError(
|
|
f"Unsupported embedding backend: {config.llm_embedding_backend}",
|
|
)
|
|
|
|
|
|
_DEFAULT_MODEL_NAMES = {
|
|
LLMEmbeddingBackend.OPENAI_LIKE: "text-embedding-3-small",
|
|
LLMEmbeddingBackend.HUGGINGFACE: "sentence-transformers/all-MiniLM-L6-v2",
|
|
LLMEmbeddingBackend.OLLAMA: "embeddinggemma",
|
|
}
|
|
|
|
|
|
def get_configured_model_name(config: AIConfig) -> str:
|
|
"""Return the canonical name of the currently configured embedding model."""
|
|
# dict.get(key, default) overload resolution fails for TextChoices keys in some
|
|
# type checkers; use `or` fallback to avoid the ambiguity.
|
|
default = (
|
|
_DEFAULT_MODEL_NAMES.get(
|
|
config.llm_embedding_backend,
|
|
)
|
|
or "sentence-transformers/all-MiniLM-L6-v2"
|
|
)
|
|
return config.llm_embedding_model or default
|
|
|
|
|
|
def _normalize_llm_index_text(text: str) -> str:
|
|
text = OCR_LEADER_REGEX.sub(" ", text)
|
|
return HORIZONTAL_WHITESPACE_REGEX.sub(" ", text)
|
|
|
|
|
|
def build_llm_index_text(doc: Document) -> str:
|
|
# Short structured fields (filename, storage path, ASN, title, tags, ...) live
|
|
# in node.metadata: excluded from embeddings, shown to the LLM via metadata
|
|
# prepend. Notes and Custom Fields stay in the body: Notes can be long free
|
|
# text, Custom Fields are dynamic in count and best kept in the embedding.
|
|
lines = [
|
|
f"Notes: {','.join([str(c.note) for c in Note.objects.filter(document=doc)])}",
|
|
]
|
|
|
|
for instance in doc.custom_fields.all():
|
|
lines.append(f"Custom Field - {instance.field.name}: {instance}")
|
|
|
|
lines.append("\nContent:\n")
|
|
lines.append(doc.content or "")
|
|
|
|
return _normalize_llm_index_text("\n".join(lines))
|