Compare commits

..

14 Commits

Author SHA1 Message Date
stumpylog 1f4a871b8f Refactor(beta): extract visible_document_ids_for_user helper
The owner-aware "resolve user to visible document pks" block was duplicated
verbatim between get_context_for_document and get_taxonomy_hints_for_document.
Extract it into indexing.visible_document_ids_for_user, next to its sibling
normalize_document_ids, and call it from both paths.

No behavior change: the helper returns None when user is None (unfiltered
retrieval) and the same pk list otherwise.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog 29f9475818 Test(beta): use documents factories for taxonomy hint test fixtures
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog d06f66b618 Test(beta): use pytest-django fixtures and drop needless DB markers in taxonomy hint tests
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog f3f55e3866 Enhancement(beta): feed taxonomy hints into AI document suggestions
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog 24b81c15f6 Enhancement(beta): splice taxonomy hints into the AI classifier prompt
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog 5202b0880e Enhancement(beta): let name matching short-circuit on taxonomy hints
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog 7ed58f9664 Enhancement(beta): gate and assemble taxonomy hints for a document
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog 43eb3295ce Enhancement(beta): format taxonomy hints into prompt blocks
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog e0ba4cfada Enhancement(beta): add taxonomy hint builder from RAG node metadata
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
stumpylog 73062bd5ab Refactor(beta): extract retrieve_similar_nodes from query_similar_documents
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-15 15:07:31 -07:00
Trenton H a020f64d08 Enhancement(beta): replace LanceDB vector store with sqlite-vec (#12990)
* Chore(beta): add sqlite-vec 0.1.9 dependency

Pinned exactly: the 0.1.9 wheels carry no baked SIMD flags (safe on
pre-AVX2 CPUs, the point of this migration); the 0.1.10 alphas bake
-mavx and would reintroduce the #12970 crash class.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Test(beta): port vector store tests to sqlite-vec backend

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Enhancement(beta): switch AI vector store from LanceDB to sqlite-vec

Fixes the non-AVX2 SIGILL class (#12970) at the root: lancedb is no
longer imported. sqlite-vec 0.1.9 wheels carry no baked SIMD, vec0
metadata columns give parameterized EQ/IN filtering, WAL preserves the
lock-free-reader model, and compact() rebuilds the table because vec0
DELETEs never reclaim space.

Implementation notes vs. the Task 3A draft:
- compact() uses a file-swap approach (new db file + Path.replace) rather
  than ALTER TABLE RENAME, which does not cascade to shadow tables in
  sqlite-vec 0.1.9 (upstream limitation).
- Bloat is tracked via a cumulative total_inserts counter in index_meta
  because the _rowids shadow table does not accumulate deleted rows in
  0.1.9 (contrary to the design doc assumption from #54).
- None distances from the zero-vector cosine edge case are mapped to
  similarity 0.0 rather than raising TypeError.
- Test suite updated accordingly: _bloat_ratio reads index_meta instead
  of _rowids; seed collision in force-compact test fixed (seed=100.0).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Enhancement(beta): wire indexing pipeline to the sqlite-vec store

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Enhancement(beta): move filename/storage path/ASN to node metadata

Same treatment as title/tags/correspondent in #12944: excluded from
the embedded text, visible to the LLM via metadata prepend. Changes
embedded text for every document, so it ships inside the sqlite-vec
transition, whose forced rebuild re-embeds everything anyway.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Test(beta): cover legacy LanceDB index cleanup and forced rebuild

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Chore(beta): drop lancedb dependency

Fixes #12970: the package whose wheels SIGILL on non-AVX2 CPUs is no
longer installed at all.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Chore(beta): partial pyrefly cleanup on sqlite-vec vector store

- Add MetadataFilter import and isinstance guard in _build_where()
- Add query_embedding None guard in query()
- Fix dict.get() type-checker ambiguity in get_configured_model_name()

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Chore(beta): drop automatic LanceDB index cleanup on startup

Leave legacy Lance directory removal to the user rather than deleting it
automatically on first run. Beta policy: user is expected to do a clean
re-embed anyway; no need for the system to silently delete their data.

Remove _cleanup_legacy_lance_index(), the forced-rebuild path that called
it, and the associated tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Chore(beta): ruff format pass on sqlite-vec AI files

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Removes the benchmarking file

* Try to resolve or silence some semgrep.  But we're using SQL here, not an ORM and we control the inputs, not users

* Enhancement(beta): add schema migration machinery to sqlite-vec vector store

Adds versioned schema migration support modelled after PR #12968's LanceDB
approach, adapted for sqlite-vec's file-swap compaction pattern.

- SCHEMA_VERSION = 1 written to index_meta at table creation and preserved
  through compact()
- Migration dataclass with from_version, to_version, kind ("structural" or
  "re-embed"), description, and an optional apply(src, dst, dim) callable
- MIGRATIONS registry (empty at v1 baseline); add entries and bump
  SCHEMA_VERSION when the schema changes
- check_and_run_migrations(): structural migrations run via the same
  file-swap as compact() (no re-embed); re-embed migrations return True
  so the caller forces a full rebuild
- update_llm_index() calls check_and_run_migrations() under the write lock
  before any indexing work

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Chore(beta): deduplicate vector store internals via helper methods

Extract three helpers to remove copy-paste between compact() and
_run_structural_migration():
- _meta_set_on(conn, key, value): static upsert into any connection's
  index_meta; _meta_set() now delegates to it
- _create_vec_table(conn, dim): CREATE VIRTUAL TABLE DDL (carries the
  nosemgrep annotation)
- _swap_in_compact(compact_path, db_path): close/replace/reconnect
  sequence used by both file-swap callers

Also normalises compact() error-path cleanup to unlink(missing_ok=True).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Adds equality test and no covers some defensive error handling stuff

* Ensures an embed migration stops the migration chain, just in case

* Silence one kind right but not really semgrep

* Trims dead assignment

* Fix(beta): address Copilot review on sqlite-vec vector store

Three findings from the PR review:

- compact() failure cleanup now unlinks the temporary .compact-wal and
  .compact-shm files, matching _run_structural_migration(); previously
  only the main .compact file was removed.
- _build_where() fails closed (1 = 0) when filters are requested but none
  translate, instead of emitting "()" which is invalid SQL; filters scope
  document access, so an empty translation must match no rows.
- Drop the unused table_name constructor parameter (all SQL hardcodes
  DEFAULT_TABLE_NAME) and its callers in indexing.py.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Enhancement(beta): guard sqlite-vec compaction swap against concurrent readers

The compaction/migration file swap replaces the database via os.replace,
but the -wal/-shm files are keyed by path, not inode. A reader holding an
open connection across the swap leaves the old WAL aliased onto the new
file; a subsequent write then corrupts the database (reproduced via
PRAGMA integrity_check).

Add a cross-process read/write lock (filelock.ReadWriteLock) over the
index:

- read_store() holds it shared for the whole connection lifetime (and
  closes the connection on exit); concurrent readers do not block.
- compaction and the migration check run under an exclusive lock that
  drains readers, and skip with an info log on Timeout (maintenance op,
  retries next run).
- Normal writes are untouched: WAL gives reader/writer concurrency and
  LLM_INDEX_LOCK still serializes writers, so they never block readers.

load_or_build_index() now takes the store from the caller's read_store()
so the lock and connection span the whole retrieval; chat holds it across
the streamed response. Two new settings: LLM_INDEX_RWLOCK and
LLM_INDEX_COMPACTION_LOCK_TIMEOUT.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* Ensures the store alays cleans up SQLite connections for any operations, even on errors

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-15 13:20:41 -07:00
Yuki MIZUNO 11fb09e4f4 Fix (beta): don't send chat message on Enter while composing with IME (CJK) (#12999)
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-06-13 13:48:19 +00:00
Trenton H 8ed4bf2011 Fix: Apply unicode normalization to all paths and path components (#12993) 2026-06-13 12:45:54 +00:00
Trenton H 92c016ce47 Fix: Handle the UTF 16 and BOM text files better (#12994) 2026-06-13 05:35:38 -07:00
35 changed files with 2844 additions and 1089 deletions
+1 -2
View File
@@ -49,7 +49,6 @@ dependencies = [
"ijson>=3.2",
"imap-tools~=1.13.0",
"jinja2~=3.1.5",
"lancedb~=0.33.0",
"langdetect~=1.0.9",
"llama-index-core>=0.14.21",
"llama-index-embeddings-huggingface>=0.6.1",
@@ -62,7 +61,6 @@ dependencies = [
"openai>=2.32",
"pathvalidate~=3.3.1",
"pdf2image~=1.17.0",
"pyarrow>=16",
"python-dateutil~=2.9.0",
"python-dotenv~=1.2.1",
"python-gnupg~=0.5.4",
@@ -74,6 +72,7 @@ dependencies = [
"scikit-learn~=1.8.0",
"sentence-transformers>=5.4.1",
"setproctitle~=1.3.4",
"sqlite-vec==0.1.9",
"tantivy~=0.26.0",
"tika-client~=0.11.0",
"torch~=2.11.0",
@@ -188,4 +188,14 @@ describe('ChatComponent', () => {
component.searchInputKeyDown(event)
expect(component.sendMessage).toHaveBeenCalled()
})
it('should not send message on Enter key press while composing with IME', () => {
jest.spyOn(component, 'sendMessage')
const event = new KeyboardEvent('keydown', {
key: 'Enter',
isComposing: true,
})
component.searchInputKeyDown(event)
expect(component.sendMessage).not.toHaveBeenCalled()
})
})
@@ -155,7 +155,10 @@ export class ChatComponent implements OnInit {
}
public searchInputKeyDown(event: KeyboardEvent) {
if (event.key === 'Enter') {
if (
event.key === 'Enter' &&
!(event.isComposing || event.keyCode === 229)
) {
event.preventDefault()
this.sendMessage()
}
@@ -1,9 +1,8 @@
import hashlib
import io
import json
import os
import shutil
import zipfile
import tempfile
from itertools import islice
from pathlib import Path
from typing import TYPE_CHECKING
@@ -99,8 +98,6 @@ class StreamingManifestWriter:
*,
compare_json: bool = False,
files_in_export_dir: "set[Path] | None" = None,
zip_file: "zipfile.ZipFile | None" = None,
zip_arcname: str | None = None,
) -> None:
self._path = path.resolve()
self._tmp_path = self._path.with_suffix(self._path.suffix + ".tmp")
@@ -108,20 +105,12 @@ class StreamingManifestWriter:
self._files_in_export_dir: set[Path] = (
files_in_export_dir if files_in_export_dir is not None else set()
)
self._zip_file = zip_file
self._zip_arcname = zip_arcname
self._zip_mode = zip_file is not None
self._file = None
self._first = True
def open(self) -> None:
if self._zip_mode:
# zipfile only allows one open write handle at a time, so buffer
# the manifest in memory and write it atomically on close()
self._file = io.StringIO()
else:
self._path.parent.mkdir(parents=True, exist_ok=True)
self._file = self._tmp_path.open("w", encoding="utf-8")
self._path.parent.mkdir(parents=True, exist_ok=True)
self._file = self._tmp_path.open("w", encoding="utf-8")
self._file.write("[")
self._first = True
@@ -142,18 +131,15 @@ class StreamingManifestWriter:
if self._file is None:
return
self._file.write("\n]")
if self._zip_mode:
self._zip_file.writestr(self._zip_arcname, self._file.getvalue())
self._file.close()
self._file = None
if not self._zip_mode:
self._finalize()
self._finalize()
def discard(self) -> None:
if self._file is not None:
self._file.close()
self._file = None
if not self._zip_mode and self._tmp_path.exists():
if self._tmp_path.exists():
self._tmp_path.unlink()
def _finalize(self) -> None:
@@ -330,13 +316,18 @@ class Command(CryptMixin, PaperlessCommand):
self.files_in_export_dir: set[Path] = set()
self.exported_files: set[str] = set()
self.zip_file: zipfile.ZipFile | None = None
self._zip_dirs: set[str] = set()
# If zipping, save the original target for later and
# get a temporary directory for the target instead
temp_dir = None
self.original_target = self.target
if self.zip_export:
zip_name = options["zip_name"]
self.zip_path = (self.target / zip_name).with_suffix(".zip")
self.zip_tmp_path = self.zip_path.parent / (self.zip_path.name + ".tmp")
settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
temp_dir = tempfile.TemporaryDirectory(
dir=settings.SCRATCH_DIR,
prefix="paperless-export",
)
self.target = Path(temp_dir.name).resolve()
if not self.target.exists():
raise CommandError("That path doesn't exist")
@@ -347,53 +338,30 @@ class Command(CryptMixin, PaperlessCommand):
if not os.access(self.target, os.W_OK):
raise CommandError("That path doesn't appear to be writable")
if self.zip_export:
if self.compare_checksums:
self.stdout.write(
self.style.WARNING(
"--compare-checksums is ignored when --zip is used",
),
)
if self.compare_json:
self.stdout.write(
self.style.WARNING(
"--compare-json is ignored when --zip is used",
),
)
try:
# Prevent any ongoing changes in the documents
with FileLock(settings.MEDIA_LOCK):
if self.zip_export:
self.zip_file = zipfile.ZipFile(
self.zip_tmp_path,
"w",
compression=zipfile.ZIP_DEFLATED,
allowZip64=True,
)
self.dump()
if self.zip_file is not None:
self.zip_file.close()
self.zip_file = None
self.zip_tmp_path.rename(self.zip_path)
# We've written everything to the temporary directory in this case,
# now make an archive in the original target, with all files stored
if self.zip_export and temp_dir is not None:
shutil.make_archive(
self.original_target / options["zip_name"],
format="zip",
root_dir=temp_dir.name,
)
finally:
# Ensure zip_file is closed and the incomplete .tmp is removed on failure
if self.zip_file is not None:
self.zip_file.close()
self.zip_file = None
if self.zip_export and self.zip_tmp_path.exists():
self.zip_tmp_path.unlink()
# Always cleanup the temporary directory, if one was created
if self.zip_export and temp_dir is not None:
temp_dir.cleanup()
def dump(self) -> None:
# 1. Take a snapshot of what files exist in the current export folder
# (skipped in zip mode — always write fresh, no skip/compare logic applies)
if not self.zip_export:
for x in self.target.glob("**/*"):
if x.is_file():
self.files_in_export_dir.add(x.resolve())
for x in self.target.glob("**/*"):
if x.is_file():
self.files_in_export_dir.add(x.resolve())
# 2. Create manifest, containing all correspondents, types, tags, storage paths
# note, documents and ui_settings
@@ -465,8 +433,6 @@ class Command(CryptMixin, PaperlessCommand):
manifest_path,
compare_json=self.compare_json,
files_in_export_dir=self.files_in_export_dir,
zip_file=self.zip_file,
zip_arcname="manifest.json",
) as writer:
with transaction.atomic():
for key, qs in manifest_key_to_object_query.items():
@@ -585,12 +551,8 @@ class Command(CryptMixin, PaperlessCommand):
self.target,
)
else:
# 5. Remove pre-existing files/dirs from target, keeping the
# in-progress zip (.tmp) and any prior zip at the final path
skip = {self.zip_path.resolve(), self.zip_tmp_path.resolve()}
for item in self.target.glob("*"):
if item.resolve() in skip:
continue
# 5. Remove anything in the original location (before moving the zip)
for item in self.original_target.glob("*"):
if item.is_dir():
shutil.rmtree(item)
else:
@@ -760,23 +722,9 @@ class Command(CryptMixin, PaperlessCommand):
if self.use_folder_prefix:
manifest_name = Path("json") / manifest_name
manifest_name = (self.target / manifest_name).resolve()
if not self.zip_export:
manifest_name.parent.mkdir(parents=True, exist_ok=True)
manifest_name.parent.mkdir(parents=True, exist_ok=True)
self.check_and_write_json(content, manifest_name)
def _ensure_zip_dirs(self, arcname: str) -> None:
"""Write directory marker entries for all parent directories of arcname.
Some zip viewers only show folder structure when explicit directory
entries exist, so we add them to avoid confusing users.
"""
parts = Path(arcname).parts[:-1]
for i in range(len(parts)):
dir_arc = "/".join(parts[: i + 1]) + "/"
if dir_arc not in self._zip_dirs:
self._zip_dirs.add(dir_arc)
self.zip_file.mkdir(dir_arc)
def check_and_write_json(
self,
content: list[dict] | dict,
@@ -789,38 +737,32 @@ class Command(CryptMixin, PaperlessCommand):
This preserves the file timestamps when no changes are made.
"""
if self.zip_export:
arcname = str(target.resolve().relative_to(self.target))
self._ensure_zip_dirs(arcname)
self.zip_file.writestr(
arcname,
target = target.resolve()
perform_write = True
if target in self.files_in_export_dir:
self.files_in_export_dir.remove(target)
if self.compare_json:
target_checksum = hashlib.blake2b(target.read_bytes()).hexdigest()
src_str = json.dumps(
content,
cls=DjangoJSONEncoder,
indent=2,
ensure_ascii=False,
)
src_checksum = hashlib.blake2b(src_str.encode("utf-8")).hexdigest()
if src_checksum == target_checksum:
perform_write = False
if perform_write:
target.write_text(
json.dumps(
content,
cls=DjangoJSONEncoder,
indent=2,
ensure_ascii=False,
),
encoding="utf-8",
)
return
target = target.resolve()
json_str = json.dumps(
content,
cls=DjangoJSONEncoder,
indent=2,
ensure_ascii=False,
)
perform_write = True
if target in self.files_in_export_dir:
self.files_in_export_dir.remove(target)
if self.compare_json:
target_checksum = hashlib.blake2b(target.read_bytes()).hexdigest()
src_checksum = hashlib.blake2b(json_str.encode("utf-8")).hexdigest()
if src_checksum == target_checksum:
perform_write = False
if perform_write:
target.write_text(json_str, encoding="utf-8")
def check_and_copy(
self,
@@ -833,12 +775,6 @@ class Command(CryptMixin, PaperlessCommand):
the source attributes
"""
if self.zip_export:
arcname = str(target.resolve().relative_to(self.target))
self._ensure_zip_dirs(arcname)
self.zip_file.write(source, arcname=arcname)
return
target = target.resolve()
if target in self.files_in_export_dir:
self.files_in_export_dir.remove(target)
+24 -15
View File
@@ -1,6 +1,7 @@
import logging
import os
import re
import unicodedata
from collections.abc import Iterable
from pathlib import PurePath
@@ -36,10 +37,12 @@ class FilePathTemplate(Template):
def clean_filepath(value: str) -> str:
"""
Clean up a filepath by:
1. Removing newlines and carriage returns
2. Removing extra spaces before and after forward slashes
3. Preserving spaces in other parts of the path
1. Normalizing Unicode to NFC form to prevent byte-level mismatches
2. Removing newlines and carriage returns
3. Removing extra spaces before and after forward slashes
4. Preserving spaces in other parts of the path
"""
value = unicodedata.normalize("NFC", value)
value = value.replace("\n", "").replace("\r", "")
value = re.sub(r"\s*/\s*", "/", value)
@@ -181,17 +184,17 @@ def get_basic_metadata_context(
"""
return {
"title": pathvalidate.sanitize_filename(
document.title,
unicodedata.normalize("NFC", document.title),
replacement_text="-",
),
"correspondent": pathvalidate.sanitize_filename(
document.correspondent.name,
unicodedata.normalize("NFC", document.correspondent.name),
replacement_text="-",
)
if document.correspondent
else no_value_default,
"document_type": pathvalidate.sanitize_filename(
document.document_type.name,
unicodedata.normalize("NFC", document.document_type.name),
replacement_text="-",
)
if document.document_type
@@ -202,7 +205,10 @@ def get_basic_metadata_context(
"owner_username": document.owner.username
if document.owner
else no_value_default,
"original_name": PurePath(document.original_filename).with_suffix("").name
"original_name": unicodedata.normalize(
"NFC",
PurePath(document.original_filename).with_suffix("").name,
)
if document.original_filename
else no_value_default,
"doc_pk": f"{document.pk:07}",
@@ -269,12 +275,12 @@ def get_tags_context(tags: Iterable[Tag]) -> dict[str, str | list[str]]:
return {
"tag_list": pathvalidate.sanitize_filename(
",".join(
sorted(tag.name for tag in tags),
sorted(unicodedata.normalize("NFC", tag.name) for tag in tags),
),
replacement_text="-",
),
# Assumed to be ordered, but a template could loop through to find what they want
"tag_name_list": [x.name for x in tags],
"tag_name_list": [unicodedata.normalize("NFC", x.name) for x in tags],
}
@@ -301,7 +307,7 @@ def get_custom_fields_context(
CustomField.FieldDataType.LONG_TEXT,
}:
value = pathvalidate.sanitize_filename(
field_instance.value,
unicodedata.normalize("NFC", field_instance.value),
replacement_text="-",
)
elif (
@@ -310,10 +316,13 @@ def get_custom_fields_context(
):
options = field_instance.field.extra_data["select_options"]
value = pathvalidate.sanitize_filename(
next(
option["label"]
for option in options
if option["id"] == field_instance.value
unicodedata.normalize(
"NFC",
next(
option["label"]
for option in options
if option["id"] == field_instance.value
),
),
replacement_text="-",
)
@@ -321,7 +330,7 @@ def get_custom_fields_context(
value = field_instance.value
field_data["custom_fields"][
pathvalidate.sanitize_filename(
field_instance.field.name,
unicodedata.normalize("NFC", field_instance.field.name),
replacement_text="-",
)
] = {
@@ -0,0 +1,95 @@
import unicodedata
from typing import TYPE_CHECKING
from unittest import mock
import celery.result
import pytest
from django.core.files.uploadedfile import SimpleUploadedFile
if TYPE_CHECKING:
from documents.data_models import ConsumableDocument
from documents.data_models import DocumentMetadataOverrides
@pytest.fixture()
def consume_file_mock():
with mock.patch("documents.tasks.consume_file.apply_async") as m:
m.return_value = celery.result.AsyncResult(id="test-task-id")
yield m
@pytest.fixture()
def directories(tmp_path, settings, _media_settings):
scratch = tmp_path / "scratch"
scratch.mkdir()
settings.SCRATCH_DIR = scratch
return scratch
@pytest.mark.django_db
class TestPostDocumentNFCNormalization:
def test_nfd_filename_normalized_to_nfc(
self,
admin_client,
consume_file_mock: mock.MagicMock,
directories,
):
"""Uploaded file with NFD filename must have its name stored as NFC."""
nfd = unicodedata.normalize("NFD", "Rechnung März.pdf")
nfc = unicodedata.normalize("NFC", "Rechnung März.pdf")
# Verify our test strings actually differ at the byte level
assert nfd != nfc
uploaded = SimpleUploadedFile(
nfd,
b"%PDF-1.4 test",
content_type="application/pdf",
)
response = admin_client.post(
"/api/documents/post_document/",
{"document": uploaded},
)
assert response.status_code == 200
task_kwargs = consume_file_mock.call_args.kwargs["kwargs"]
input_doc: ConsumableDocument = task_kwargs["input_doc"]
overrides: DocumentMetadataOverrides = task_kwargs["overrides"]
# The temp file on disk must have an NFC name
assert input_doc.original_file.name == nfc, (
f"Expected NFC filename {nfc!r}, got {input_doc.original_file.name!r}"
)
# The override filename stored for later use must also be NFC
assert overrides.filename == nfc, (
f"Expected NFC override filename {nfc!r}, got {overrides.filename!r}"
)
assert unicodedata.is_normalized("NFC", overrides.filename)
def test_already_nfc_filename_unchanged(
self,
admin_client,
consume_file_mock: mock.MagicMock,
directories,
):
"""Uploaded file with already-NFC filename must pass through unchanged."""
nfc = unicodedata.normalize("NFC", "Invoice_2024.pdf")
uploaded = SimpleUploadedFile(
nfc,
b"%PDF-1.4 test",
content_type="application/pdf",
)
response = admin_client.post(
"/api/documents/post_document/",
{"document": uploaded},
)
assert response.status_code == 200
task_kwargs = consume_file_mock.call_args.kwargs["kwargs"]
overrides: DocumentMetadataOverrides = task_kwargs["overrides"]
assert overrides.filename == nfc
assert unicodedata.is_normalized("NFC", overrides.filename)
+187
View File
@@ -0,0 +1,187 @@
"""
Tests for NFC Unicode normalization in generate_filename / FilePathTemplate.render().
NFC `ü` (UTF-8: c3 bc) and NFD `ü` (UTF-8: 75 cc 88) are visually identical but
produce different byte sequences. On Linux (ext4, ZFS) these are distinct filenames.
All paths produced by the templating system must be NFC-normalized.
"""
import unicodedata
import pytest
from documents.file_handling import generate_filename
from documents.models import CustomField
from documents.models import CustomFieldInstance
from documents.tests.factories import CorrespondentFactory
from documents.tests.factories import DocumentFactory
from documents.tests.factories import StoragePathFactory
from documents.tests.factories import TagFactory
@pytest.mark.django_db
class TestGenerateFilenameNFCNormalization:
@pytest.mark.parametrize(
"raw,display",
[
(unicodedata.normalize("NFD", "Gemüse"), "Gemüse"),
(unicodedata.normalize("NFD", "Café"), "Café"),
(unicodedata.normalize("NFD", "naïve"), "naïve"),
],
)
def test_nfd_title_normalized_to_nfc(self, settings, raw, display):
"""NFD title must produce NFC path bytes."""
settings.FILENAME_FORMAT = "{{ title }}"
nfc = unicodedata.normalize("NFC", display)
assert raw != nfc # confirm byte-level difference
doc = DocumentFactory(title=raw, mime_type="application/pdf")
result = generate_filename(doc)
assert str(result) == f"{nfc}.pdf"
assert str(result).encode() == f"{nfc}.pdf".encode()
def test_nfd_correspondent_normalized_to_nfc(self, settings):
"""NFD correspondent name must produce NFC path component."""
settings.FILENAME_FORMAT = "{{ correspondent }}/{{ title }}"
nfd = unicodedata.normalize("NFD", "Müller")
nfc = unicodedata.normalize("NFC", "Müller")
correspondent = CorrespondentFactory(name=nfd)
doc = DocumentFactory(
title="invoice",
correspondent=correspondent,
mime_type="application/pdf",
)
result = generate_filename(doc)
assert str(result) == f"{nfc}/invoice.pdf"
assert str(result).encode() == f"{nfc}/invoice.pdf".encode()
def test_nfd_storage_path_normalized_to_nfc(self, settings):
"""NFD literal in StoragePath.path template must produce NFC path bytes."""
settings.FILENAME_FORMAT = None
nfd = unicodedata.normalize("NFD", "Büro")
nfc = unicodedata.normalize("NFC", "Büro")
# StoragePath.path is used directly as the format/template string.
# Literal NFD characters in the template must survive rendering as NFC.
sp = StoragePathFactory(path=f"{nfd}/{{{{ title }}}}")
doc = DocumentFactory(title="doc", storage_path=sp, mime_type="application/pdf")
result = generate_filename(doc)
assert str(result).encode() == f"{nfc}/doc.pdf".encode()
def test_nfd_raw_document_title_normalized_to_nfc(self, settings):
"""NFD title accessed via document.title (unsanitized context) must also be NFC."""
settings.FILENAME_FORMAT = "{{ document.title }}"
nfd = unicodedata.normalize("NFD", "Café")
nfc = unicodedata.normalize("NFC", "Café")
doc = DocumentFactory(title=nfd, mime_type="application/pdf")
result = generate_filename(doc)
assert str(result) == f"{nfc}.pdf"
assert str(result).encode() == f"{nfc}.pdf".encode()
@pytest.mark.django_db
class TestContextBuilderNFCNormalization:
"""
Defense-in-depth: context builder functions must NFC-normalize string inputs
before passing them to sanitize_filename(). Task 1 already normalizes the
final rendered path via clean_filepath(), so these tests may already pass;
they exist as regression guards for the context-builder layer.
"""
def test_nfd_tag_name_normalized_in_tag_list(self, settings):
"""NFD tag name must appear as NFC bytes in the {{ tag_list }} shorthand."""
settings.FILENAME_FORMAT = "{{ tag_list }}/{{ title }}"
nfd = unicodedata.normalize("NFD", "Büro")
nfc = unicodedata.normalize("NFC", "Büro")
assert nfd != nfc # confirm they differ at byte level
tag = TagFactory(name=nfd)
doc = DocumentFactory(title="doc", mime_type="application/pdf")
doc.tags.set([tag])
result = generate_filename(doc)
assert str(result).encode() == f"{nfc}/doc.pdf".encode()
def test_nfd_original_name_normalized_to_nfc(self, settings):
settings.FILENAME_FORMAT = "{{ original_name }}"
nfd = unicodedata.normalize("NFD", "Rechnung März")
nfc = unicodedata.normalize("NFC", "Rechnung März")
doc = DocumentFactory(
original_filename=f"{nfd}.pdf",
mime_type="application/pdf",
)
result = generate_filename(doc)
assert str(result).encode() == f"{nfc}.pdf".encode()
def test_nfd_custom_field_string_value_normalized(self, settings):
"""NFD value in a STRING-type custom field must appear as NFC in the context."""
settings.FILENAME_FORMAT = (
"{{ custom_fields['Location']['value'] }}/{{ title }}"
)
nfd_value = unicodedata.normalize("NFD", "Düsseldorf")
nfc_value = unicodedata.normalize("NFC", "Düsseldorf")
assert nfd_value != nfc_value
doc = DocumentFactory(title="report", mime_type="application/pdf")
cf = CustomField.objects.create(
name="Location",
data_type=CustomField.FieldDataType.STRING,
)
CustomFieldInstance.objects.create(
document=doc,
field=cf,
value_text=nfd_value,
)
result = generate_filename(doc)
assert str(result).encode() == f"{nfc_value}/report.pdf".encode()
def test_nfd_custom_field_name_normalized_as_key(self, settings):
"""NFD characters in a custom field name must appear as NFC in the context dict key."""
nfd_name = unicodedata.normalize("NFD", "Größe")
nfc_name = unicodedata.normalize("NFC", "Größe")
assert nfd_name != nfc_name
settings.FILENAME_FORMAT = f"{{% if custom_fields['{nfc_name}'] %}}{{{{ custom_fields['{nfc_name}']['value'] }}}}/{{{{ title }}}}{{% else %}}{{{{ title }}}}{{% endif %}}"
doc = DocumentFactory(title="letter", mime_type="application/pdf")
cf = CustomField.objects.create(
name=nfd_name,
data_type=CustomField.FieldDataType.STRING,
)
CustomFieldInstance.objects.create(
document=doc,
field=cf,
value_text="Berlin",
)
result = generate_filename(doc)
# If field name key is NFC-normalized, the template condition succeeds
# and result is "Berlin/letter.pdf"; otherwise it falls back to "letter.pdf"
assert str(result) == "Berlin/letter.pdf"
def test_nfd_tag_name_list_normalized_to_nfc(self, settings):
"""NFD tag names in tag_name_list must appear as NFC bytes when iterated."""
settings.FILENAME_FORMAT = (
"{% for t in tag_name_list %}{{ t }}{% endfor %}/{{ title }}"
)
nfd = unicodedata.normalize("NFD", "Büro")
nfc = unicodedata.normalize("NFC", "Büro")
assert nfd != nfc # confirm byte-level difference
doc = DocumentFactory(title="doc", mime_type="application/pdf")
doc.tags.add(TagFactory(name=nfd))
result = generate_filename(doc)
assert str(result).encode() == f"{nfc}/doc.pdf".encode()
@@ -615,7 +615,7 @@ class TestExportImport(
self.assertIsFile(expected_file)
with ZipFile(expected_file) as zip:
# 11 files + 3 directory marker entries for the subdirectory structure
# Extras are from the directories, which also appear in the listing
self.assertEqual(len(zip.namelist()), 14)
self.assertIn("manifest.json", zip.namelist())
self.assertIn("metadata.json", zip.namelist())
@@ -666,57 +666,6 @@ class TestExportImport(
self.assertIn("manifest.json", zip.namelist())
self.assertIn("metadata.json", zip.namelist())
def test_export_zip_atomic_on_failure(self) -> None:
"""
GIVEN:
- Request to export documents to zipfile
WHEN:
- Export raises an exception mid-way
THEN:
- No .zip file is written at the final path
- The .tmp file is cleaned up
"""
args = ["document_exporter", self.target, "--zip"]
with mock.patch.object(
document_exporter.Command,
"dump",
side_effect=RuntimeError("simulated failure"),
):
with self.assertRaises(RuntimeError):
call_command(*args)
expected_zip = self.target / f"export-{timezone.localdate().isoformat()}.zip"
expected_tmp = (
self.target / f"export-{timezone.localdate().isoformat()}.zip.tmp"
)
self.assertIsNotFile(expected_zip)
self.assertIsNotFile(expected_tmp)
def test_export_zip_no_scratch_dir(self) -> None:
"""
GIVEN:
- Request to export documents to zipfile
WHEN:
- Documents are exported
THEN:
- No files are written under SCRATCH_DIR during the export
(the old workaround used a temp dir there)
"""
shutil.rmtree(Path(self.dirs.media_dir) / "documents")
shutil.copytree(
Path(__file__).parent / "samples" / "documents",
Path(self.dirs.media_dir) / "documents",
)
scratch_before = set(settings.SCRATCH_DIR.glob("paperless-export*"))
args = ["document_exporter", self.target, "--zip"]
call_command(*args)
scratch_after = set(settings.SCRATCH_DIR.glob("paperless-export*"))
self.assertEqual(scratch_before, scratch_after)
def test_export_target_not_exists(self) -> None:
"""
GIVEN:
+3
View File
@@ -368,6 +368,7 @@ class TestAISuggestions(DirectoriesMixin, TestCase):
self.document,
self.user,
None,
hints=None,
)
@patch("documents.views.get_ai_document_classification")
@@ -399,6 +400,7 @@ class TestAISuggestions(DirectoriesMixin, TestCase):
self.document,
self.user,
"de-de",
hints=None,
)
self.assertEqual(
get_llm_suggestion_cache(
@@ -438,6 +440,7 @@ class TestAISuggestions(DirectoriesMixin, TestCase):
self.document,
self.user,
"fr-fr",
hints=None,
)
self.assertEqual(
get_llm_suggestion_cache(
+9
View File
@@ -245,6 +245,7 @@ from paperless_ai.matching import match_correspondents_by_name
from paperless_ai.matching import match_document_types_by_name
from paperless_ai.matching import match_storage_paths_by_name
from paperless_ai.matching import match_tags_by_name
from paperless_ai.taxonomy import get_taxonomy_hints_for_document
from paperless_mail.models import MailAccount
from paperless_mail.models import MailRule
from paperless_mail.oauth import PaperlessMailOAuth2Manager
@@ -1494,11 +1495,14 @@ class DocumentViewSet(
refresh_suggestions_cache(doc.pk)
return Response(cached_llm_suggestions.suggestions)
hints = get_taxonomy_hints_for_document(doc, request.user)
try:
llm_suggestions = get_ai_document_classification(
doc,
request.user,
output_language,
hints=hints,
)
except ValueError as exc:
logger.exception(
@@ -1513,18 +1517,22 @@ class DocumentViewSet(
matched_tags = match_tags_by_name(
llm_suggestions.get("tags", []),
request.user,
hinted_names=set(hints["tags"]) if hints else None,
)
matched_correspondents = match_correspondents_by_name(
llm_suggestions.get("correspondents", []),
request.user,
hinted_names=set(hints["correspondents"]) if hints else None,
)
matched_types = match_document_types_by_name(
llm_suggestions.get("document_types", []),
request.user,
hinted_names=set(hints["document_types"]) if hints else None,
)
matched_paths = match_storage_paths_by_name(
llm_suggestions.get("storage_paths", []),
request.user,
hinted_names=set(hints["storage_paths"]) if hints else None,
)
resp_data = {
@@ -3126,6 +3134,7 @@ class PostDocumentView(GenericAPIView[Any]):
serializer.is_valid(raise_exception=True)
doc_name, doc_data = serializer.validated_data.get("document")
doc_name = normalize("NFC", doc_name)
correspondent_id = serializer.validated_data.get("correspondent")
document_type_id = serializer.validated_data.get("document_type")
storage_path_id = serializer.validated_data.get("storage_path")
+2 -28
View File
@@ -20,6 +20,7 @@ from PIL import Image
from PIL import ImageDraw
from PIL import ImageFont
from paperless.parsers.utils import read_file_handle_unicode_errors
from paperless.version import __full_version_str__
if TYPE_CHECKING:
@@ -183,7 +184,7 @@ class TextDocumentParser:
documents.parsers.ParseError
If the file cannot be read.
"""
self._text = self._read_text(document_path)
self._text = read_file_handle_unicode_errors(document_path, log=logger)
# ------------------------------------------------------------------
# Result accessors
@@ -295,30 +296,3 @@ class TextDocumentParser:
Always ``[]`` — plain text files carry no structured metadata.
"""
return []
# ------------------------------------------------------------------
# Private helpers
# ------------------------------------------------------------------
def _read_text(self, filepath: Path) -> str:
"""Read file content, replacing invalid UTF-8 bytes rather than failing.
Parameters
----------
filepath:
Path to the file to read.
Returns
-------
str
File content as a string.
"""
try:
return filepath.read_text(encoding="utf-8")
except UnicodeDecodeError as exc:
logger.warning(
"Unicode error reading %s, replacing bad bytes: %s",
filepath,
exc,
)
return filepath.read_bytes().decode("utf-8", errors="replace")
+18 -5
View File
@@ -8,6 +8,7 @@ share implementation.
from __future__ import annotations
import codecs
import logging
import re
import tempfile
@@ -114,7 +115,7 @@ def read_file_handle_unicode_errors(
filepath: Path,
log: logging.Logger | None = None,
) -> str:
"""Read a file as UTF-8 text, replacing invalid bytes rather than raising.
"""Read a file as text, detecting encoding via BOM and stripping NUL bytes.
Parameters
----------
@@ -127,15 +128,27 @@ def read_file_handle_unicode_errors(
Returns
-------
str
File content as a string, with any invalid UTF-8 sequences replaced
by the Unicode replacement character.
File content as a string, with NUL bytes removed so the result is
safe to store in PostgreSQL text fields.
"""
_log = log or logger
raw = filepath.read_bytes()
if raw.startswith((codecs.BOM_UTF16_LE, codecs.BOM_UTF16_BE)):
encoding = "utf-16"
elif raw.startswith(codecs.BOM_UTF8):
encoding = "utf-8-sig"
else:
encoding = "utf-8"
try:
return filepath.read_text(encoding="utf-8")
text = raw.decode(encoding)
except UnicodeDecodeError as e:
_log.warning("Unicode error during text reading, continuing: %s", e)
return filepath.read_bytes().decode("utf-8", errors="replace")
text = raw.decode("utf-8", errors="replace")
# PostgreSQL rejects NUL (0x00) bytes in text fields
return text.replace("\x00", "")
def get_page_count_for_pdf(
+7
View File
@@ -98,6 +98,13 @@ MODEL_FILE = get_path_from_env(
)
LLM_INDEX_DIR = DATA_DIR / "llm_index"
LLM_INDEX_LOCK = LLM_INDEX_DIR / "index.lock"
# Cross-process read/write lock guarding the LLM index compaction/migration
# file swap. Readers hold it shared; the swap takes it exclusively so it never
# runs while a reader connection is open. Must be a SQLite (.db) file.
LLM_INDEX_RWLOCK = LLM_INDEX_DIR / "llmindex.rwlock.db"
# Seconds the compaction swap waits for active readers to drain before skipping
# this cycle (it is a maintenance operation; the next run retries).
LLM_INDEX_COMPACTION_LOCK_TIMEOUT = 30
LOGGING_DIR = get_path_from_env("PAPERLESS_LOGGING_DIR", DATA_DIR / "log")
+37
View File
@@ -2,13 +2,50 @@
from __future__ import annotations
import codecs
from pathlib import Path
from paperless.parsers.utils import is_tagged_pdf
from paperless.parsers.utils import read_file_handle_unicode_errors
SAMPLES = Path(__file__).parent / "samples" / "tesseract"
class TestReadFileHandleUnicodeErrors:
def test_plain_utf8(self, tmp_path: Path) -> None:
f = tmp_path / "plain.txt"
f.write_bytes(b"hello world")
assert read_file_handle_unicode_errors(f) == "hello world"
def test_utf8_bom(self, tmp_path: Path) -> None:
f = tmp_path / "bom.txt"
f.write_bytes(codecs.BOM_UTF8 + b"hello")
assert read_file_handle_unicode_errors(f) == "hello"
def test_utf16_le(self, tmp_path: Path) -> None:
f = tmp_path / "utf16le.txt"
f.write_bytes(codecs.BOM_UTF16_LE + "hello".encode("utf-16-le"))
assert read_file_handle_unicode_errors(f) == "hello"
def test_utf16_be(self, tmp_path: Path) -> None:
f = tmp_path / "utf16be.txt"
f.write_bytes(codecs.BOM_UTF16_BE + "hello".encode("utf-16-be"))
assert read_file_handle_unicode_errors(f) == "hello"
def test_nul_bytes_stripped(self, tmp_path: Path) -> None:
f = tmp_path / "null-bytes.txt"
f.write_bytes(b"foo\x00bar")
assert read_file_handle_unicode_errors(f) == "foobar"
def test_invalid_utf8_replaced(self, tmp_path: Path) -> None:
f = tmp_path / "bad.txt"
f.write_bytes(b"ok\x80\x81bad")
result = read_file_handle_unicode_errors(f)
assert "ok" in result
assert "bad" in result
assert "\x00" not in result
class TestIsTaggedPdf:
def test_tagged_pdf_returns_true(self) -> None:
assert is_tagged_pdf(SAMPLES / "simple-digital.pdf") is True
+20 -19
View File
@@ -1,16 +1,21 @@
import json
import logging
from typing import TYPE_CHECKING
from django.conf import settings
from django.contrib.auth.models import User
from documents.models import Document
from documents.permissions import get_objects_for_user_owner_aware
from paperless.config import AIConfig
from paperless_ai.client import AIClient
from paperless_ai.db import db_connection_released
from paperless_ai.indexing import query_similar_documents
from paperless_ai.indexing import truncate_content
from paperless_ai.indexing import visible_document_ids_for_user
from paperless_ai.taxonomy import format_hints_for_prompt
if TYPE_CHECKING:
from paperless_ai.taxonomy import TaxonomyHints
logger = logging.getLogger("paperless_ai.rag_classifier")
@@ -26,6 +31,7 @@ def get_language_name(language_code: str) -> str:
def build_prompt_without_rag(
document: Document,
config: AIConfig,
hints: "TaxonomyHints | None" = None,
) -> str:
filename = document.filename or ""
content = truncate_content(
@@ -34,10 +40,16 @@ def build_prompt_without_rag(
context_size=config.llm_context_size,
)
hints_block = format_hints_for_prompt(hints) if hints else ""
# Splice the block (if any) immediately before the "Analyze ..." instruction.
# When there is no block this expands to nothing, so the prompt is identical
# to the pre-hints baseline.
hints_section = f"{hints_block}\n\n " if hints_block else ""
return f"""
You are a document classification assistant.
Analyze the following document and extract the following information:
{hints_section}Analyze the following document and extract the following information:
- A short descriptive title
- Tags that reflect the content
- Names of people or organizations mentioned
@@ -57,8 +69,9 @@ def build_prompt_with_rag(
document: Document,
config: AIConfig,
user: User | None = None,
hints: "TaxonomyHints | None" = None,
) -> str:
base_prompt = build_prompt_without_rag(document, config)
base_prompt = build_prompt_without_rag(document, config, hints=hints)
context = truncate_content(
get_context_for_document(document, user),
chunk_size=config.llm_embedding_chunk_size,
@@ -96,20 +109,7 @@ def get_context_for_document(
user: User | None = None,
max_docs: int = 5,
) -> str:
visible_documents = (
get_objects_for_user_owner_aware(
user,
"view_document",
Document,
)
if user
else None
)
visible_document_ids = (
list(visible_documents.values_list("pk", flat=True))
if visible_documents is not None
else None
)
visible_document_ids = visible_document_ids_for_user(user)
similar_docs = query_similar_documents(
document=doc,
document_ids=visible_document_ids,
@@ -137,13 +137,14 @@ def get_ai_document_classification(
document: Document,
user: User | None = None,
output_language: str | None = None,
hints: "TaxonomyHints | None" = None,
) -> dict:
ai_config = AIConfig()
prompt = (
build_prompt_with_rag(document, ai_config, user)
build_prompt_with_rag(document, ai_config, user, hints=hints)
if ai_config.llm_embedding_backend
else build_prompt_without_rag(document, ai_config)
else build_prompt_without_rag(document, ai_config, hints=hints)
)
client = AIClient()
+49 -42
View File
@@ -9,6 +9,7 @@ from paperless_ai.db import db_connection_released
from paperless_ai.indexing import _document_id_filters
from paperless_ai.indexing import get_rag_prompt_helper
from paperless_ai.indexing import load_or_build_index
from paperless_ai.indexing import read_store
logger = logging.getLogger("paperless_ai.chat")
@@ -97,53 +98,59 @@ def _stream_chat_with_documents(query_str: str, documents: list[Document]):
from llama_index.core.retrievers import VectorIndexRetriever
config = AIConfig()
index = load_or_build_index(config)
filters = _document_id_filters(str(doc.pk) for doc in documents)
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=CHAT_RETRIEVER_TOP_K,
filters=filters,
)
# Hold the shared read lock for the whole operation: the query engine
# retrieves from the vector store again during synthesis, so the connection
# must stay open (and the swap must not run) until the stream finishes.
with read_store() as store:
index = load_or_build_index(config, store)
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=CHAT_RETRIEVER_TOP_K,
filters=filters,
)
# Slow query-embedding + vector search; no Django ORM access happens during
# it, so release the pooled DB connection for its duration. See #12976.
with db_connection_released():
top_nodes = retriever.retrieve(query_str)
if not top_nodes:
logger.warning("No nodes found for the given documents.")
yield CHAT_NO_CONTENT_MESSAGE
return
# Slow query-embedding + vector search; no Django ORM access happens
# during it, so release the pooled DB connection for its duration. See
# #12976.
with db_connection_released():
top_nodes = retriever.retrieve(query_str)
if not top_nodes:
logger.warning("No nodes found for the given documents.")
yield CHAT_NO_CONTENT_MESSAGE
return
client = AIClient()
client = AIClient()
references = _get_document_references(documents, top_nodes)
references = _get_document_references(documents, top_nodes)
prompt_template = PromptTemplate(template=CHAT_PROMPT_TMPL)
response_synthesizer = get_response_synthesizer(
llm=client.llm,
prompt_helper=get_rag_prompt_helper(
chunk_size=config.llm_embedding_chunk_size,
context_size=config.llm_context_size,
),
text_qa_template=prompt_template,
streaming=True,
)
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
llm=client.llm,
response_synthesizer=response_synthesizer,
streaming=True,
)
prompt_template = PromptTemplate(template=CHAT_PROMPT_TMPL)
response_synthesizer = get_response_synthesizer(
llm=client.llm,
prompt_helper=get_rag_prompt_helper(
chunk_size=config.llm_embedding_chunk_size,
context_size=config.llm_context_size,
),
text_qa_template=prompt_template,
streaming=True,
)
query_engine = RetrieverQueryEngine.from_args(
retriever=retriever,
llm=client.llm,
response_synthesizer=response_synthesizer,
streaming=True,
)
logger.debug("Document chat query: %s", query_str)
# Release the pooled DB connection for the slow streaming LLM response so it
# is not pinned for the whole stream; see paperless_ai.db and #12976.
with db_connection_released():
response_stream = query_engine.query(query_str)
for chunk in response_stream.response_gen:
yield chunk
sys.stdout.flush()
logger.debug("Document chat query: %s", query_str)
# Release the pooled DB connection for the slow streaming LLM response
# so it is not pinned for the whole stream; see paperless_ai.db and
# #12976.
with db_connection_released():
response_stream = query_engine.query(query_str)
for chunk in response_stream.response_gen:
yield chunk
sys.stdout.flush()
if references:
yield _format_chat_metadata_trailer(references)
if references:
yield _format_chat_metadata_trailer(references)
+11 -11
View File
@@ -99,9 +99,13 @@ _DEFAULT_MODEL_NAMES = {
def get_configured_model_name(config: AIConfig) -> str:
"""Return the canonical name of the currently configured embedding model."""
default = _DEFAULT_MODEL_NAMES.get(
config.llm_embedding_backend,
"sentence-transformers/all-MiniLM-L6-v2",
# dict.get(key, default) overload resolution fails for TextChoices keys in some
# type checkers; use `or` fallback to avoid the ambiguity.
default = (
_DEFAULT_MODEL_NAMES.get(
config.llm_embedding_backend,
)
or "sentence-transformers/all-MiniLM-L6-v2"
)
return config.llm_embedding_model or default
@@ -112,15 +116,11 @@ def _normalize_llm_index_text(text: str) -> str:
def build_llm_index_text(doc: Document) -> str:
# TODO: Filename, Storage Path, and Archive Serial Number are short structured
# values that could move to node.metadata (excluded from embeddings, visible to
# LLM via metadata prepend) — same pattern as title/tags/correspondent. Notes
# and Custom Fields should stay here: Notes can be long free text, Custom Fields
# are dynamic in count and best kept in the embedding.
# Short structured fields (filename, storage path, ASN, title, tags, ...) live
# in node.metadata: excluded from embeddings, shown to the LLM via metadata
# prepend. Notes and Custom Fields stay in the body: Notes can be long free
# text, Custom Fields are dynamic in count and best kept in the embedding.
lines = [
f"Filename: {doc.filename}",
f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
f"Archive Serial Number: {doc.archive_serial_number or ''}",
f"Notes: {','.join([str(c.note) for c in Note.objects.filter(document=doc)])}",
]
+194 -49
View File
@@ -5,11 +5,15 @@ from datetime import timedelta
from typing import TYPE_CHECKING
from django.conf import settings
from django.contrib.auth.models import User
from django.utils import timezone
from filelock import FileLock
from filelock import ReadWriteLock
from filelock import Timeout
from documents.models import Document
from documents.models import PaperlessTask
from documents.permissions import get_objects_for_user_owner_aware
from documents.utils import IterWrapper
from documents.utils import identity
from paperless.config import AIConfig
@@ -20,14 +24,13 @@ from paperless_ai.embedding import get_embedding_model
if TYPE_CHECKING:
from llama_index.core.schema import BaseNode
from llama_index.core.schema import NodeWithScore
from paperless_ai.vector_store import PaperlessLanceVectorStore
from paperless_ai.vector_store import PaperlessSqliteVecVectorStore
logger = logging.getLogger("paperless_ai.indexing")
LLM_INDEX_TABLE = "documents"
RAG_NUM_OUTPUT = 512
RAG_CHUNK_OVERLAP = 200
@@ -63,36 +66,108 @@ def queue_llm_index_update_if_needed(*, rebuild: bool, reason: str) -> bool:
return True
def get_vector_store() -> "PaperlessLanceVectorStore":
from paperless_ai.vector_store import PaperlessLanceVectorStore
def get_vector_store() -> "PaperlessSqliteVecVectorStore":
from paperless_ai.vector_store import PaperlessSqliteVecVectorStore
settings.LLM_INDEX_DIR.mkdir(parents=True, exist_ok=True)
return PaperlessLanceVectorStore(
return PaperlessSqliteVecVectorStore(
uri=str(settings.LLM_INDEX_DIR),
table_name=LLM_INDEX_TABLE,
)
# --- LLM index locking ---------------------------------------------------
#
# Two locks guard the index; they answer different questions and are NOT
# interchangeable:
#
# * settings.LLM_INDEX_LOCK (FileLock, exclusive) -- serializes WRITERS against
# each other, so only one rebuild/upsert/delete/compaction runs at a time.
# Taken by write_store(). Readers never take it, so it never blocks reads.
#
# * settings.LLM_INDEX_RWLOCK (ReadWriteLock) -- coordinates readers against the
# compaction/migration file swap. read_store() takes it SHARED (readers run
# concurrently); _exclude_readers() takes it EXCLUSIVE, only for the swap, so
# the database file is never replaced while a reader connection is open (that
# would alias the old WAL onto the new file and corrupt it).
#
# | vs another writer | vs a reader
# -----------------+-------------------+----------------------------
# normal write | LLM_INDEX_LOCK | nothing (WAL gives MVCC)
# compaction/swap | LLM_INDEX_LOCK | LLM_INDEX_RWLOCK (exclusive)
# reader | nothing (WAL) | LLM_INDEX_RWLOCK (shared)
#
# They can't be merged into one ReadWriteLock: a normal write must exclude other
# writers WITHOUT blocking readers (WAL already gives reader/writer concurrency),
# and ReadWriteLock has no "exclusive vs writers, shared vs readers" mode. Only
# the swap needs to exclude readers.
def _index_rwlock() -> ReadWriteLock:
"""Return a fresh read/write lock instance for the index swap.
``is_singleton=False`` so reads and the swap always coordinate through
SQLite (the actual cross-process case) rather than hitting the in-process
reentrant-upgrade guard; callers must ``close()`` it (the context managers
below do).
"""
settings.LLM_INDEX_DIR.mkdir(parents=True, exist_ok=True)
return ReadWriteLock(str(settings.LLM_INDEX_RWLOCK), is_singleton=False)
@contextmanager
def read_store():
"""Acquire the shared read lock and yield the vector store for a read.
The shared lock is held for the whole lifetime of the connection (and
closed on exit) so the compaction/migration swap, which takes the exclusive
lock, never runs while this connection is open. Concurrent readers do not
block each other; only the swap does.
"""
lock = _index_rwlock()
try:
with lock.read_lock(), get_vector_store() as store:
yield store
finally:
lock.close()
@contextmanager
def _exclude_readers():
"""Acquire exclusive index access, blocking until readers have drained.
The exclusive counterpart to ``read_store()``: a compaction or migration
must not run while any reader connection is open. Raises
:class:`filelock.Timeout` if active readers do not drain within
``LLM_INDEX_COMPACTION_LOCK_TIMEOUT``; callers skip the operation on timeout.
"""
lock = _index_rwlock()
try:
with lock.write_lock(timeout=settings.LLM_INDEX_COMPACTION_LOCK_TIMEOUT):
yield
finally:
lock.close()
@contextmanager
def write_store(embed_model_name: str | None = None):
"""Acquire the write lock and yield the vector store.
All mutating operations (upsert, delete, rebuild, compact) must go through
this context manager to serialise concurrent Celery writers.
Read paths use ``get_vector_store()`` directly — no lock needed.
Read paths use ``read_store()`` so they hold the shared read lock.
Pass ``embed_model_name`` whenever the operation may create the table so
the model name is recorded in the schema metadata for future mismatch checks.
"""
from paperless_ai.vector_store import PaperlessLanceVectorStore
from paperless_ai.vector_store import PaperlessSqliteVecVectorStore
settings.LLM_INDEX_DIR.mkdir(parents=True, exist_ok=True)
with FileLock(settings.LLM_INDEX_LOCK):
yield PaperlessLanceVectorStore(
with (
FileLock(settings.LLM_INDEX_LOCK),
PaperlessSqliteVecVectorStore(
uri=str(settings.LLM_INDEX_DIR),
table_name=LLM_INDEX_TABLE,
embed_model_name=embed_model_name,
)
) as store,
):
yield store
def build_document_node(
@@ -114,6 +189,9 @@ def build_document_node(
"document_type": document.document_type.name
if document.document_type
else None,
"filename": document.filename,
"storage_path": document.storage_path.name if document.storage_path else None,
"archive_serial_number": document.archive_serial_number,
"created": document.created.isoformat() if document.created else None,
"added": document.added.isoformat() if document.added else None,
"modified": document.modified.isoformat(),
@@ -140,23 +218,27 @@ def build_document_node(
return parser.get_nodes_from_documents([doc])
def load_or_build_index(config: AIConfig):
"""Return a VectorStoreIndex backed by the vector store."""
def load_or_build_index(config: AIConfig, store: "PaperlessSqliteVecVectorStore"):
"""Return a VectorStoreIndex backed by ``store``.
``store`` is supplied by the caller's ``read_store()`` context so the shared
read lock and the connection stay alive for the whole retrieval.
"""
import llama_index.core.settings as llama_settings
from llama_index.core import VectorStoreIndex
embed_model = get_embedding_model(config)
llama_settings.Settings.embed_model = embed_model
vector_store = get_vector_store()
return VectorStoreIndex.from_vector_store(
vector_store=vector_store,
vector_store=store,
embed_model=embed_model,
)
def llm_index_exists() -> bool:
"""True when the index table exists on disk."""
return get_vector_store().table_exists()
with read_store() as store:
return store.table_exists()
def get_rag_chunk_size() -> int:
@@ -224,6 +306,21 @@ def update_llm_index(
rebuild=False,
) -> str:
"""Rebuild or incrementally update the LLM index."""
with write_store() as store:
try:
with _exclude_readers():
needs_reembed = store.check_and_run_migrations()
except Timeout:
logger.info(
"Skipping LLM index migration check: index readers are active; "
"will retry next run.",
)
needs_reembed = False
if needs_reembed:
logger.warning(
"LLM index migration requires re-embedding; forcing rebuild.",
)
rebuild = True
documents = Document.objects.all()
no_documents = not documents.exists()
@@ -235,13 +332,12 @@ def update_llm_index(
config = AIConfig()
model_name = get_configured_model_name(config)
if (
not rebuild
and llm_index_exists()
and get_vector_store().config_mismatch(model_name)
):
logger.warning("Embedding model changed; forcing LLM index rebuild.")
rebuild = True
if not rebuild and llm_index_exists():
with read_store() as store:
config_mismatch = store.config_mismatch(model_name)
if config_mismatch:
logger.warning("Embedding model changed; forcing LLM index rebuild.")
rebuild = True
if no_documents:
logger.warning("No documents found to index.")
@@ -251,7 +347,6 @@ def update_llm_index(
with write_store(embed_model_name=model_name) as store:
if rebuild or not store.table_exists():
(settings.LLM_INDEX_DIR / "meta.json").unlink(missing_ok=True)
logger.info("Rebuilding LLM index.")
store.drop_table()
for document in iter_wrapper(documents):
@@ -276,9 +371,14 @@ def update_llm_index(
else "No changes detected in LLM index."
)
store.ensure_document_id_scalar_index()
store.maybe_create_ann_index()
store.compact(retention_seconds=60 * 60) # 1 hour: safe for in-flight readers
try:
with _exclude_readers():
store.compact()
except Timeout:
logger.info(
"Skipping LLM index compaction: index readers are active; "
"will retry next run.",
)
return msg
@@ -294,13 +394,19 @@ def llm_index_add_or_update_document(document: Document):
with write_store(embed_model_name=get_configured_model_name(config)) as store:
store.upsert_document(str(document.id), new_nodes)
store.ensure_document_id_scalar_index()
def llm_index_compact() -> None:
"""Compact the index immediately, clearing all MVCC version history."""
"""Compact the index immediately, rebuilding the table to reclaim space."""
with write_store() as store:
store.compact(retention_seconds=0)
try:
with _exclude_readers():
store.compact(force=True)
except Timeout:
logger.info(
"Skipping LLM index compaction: index readers are active; "
"will retry next run.",
)
def llm_index_remove_document(document: Document):
@@ -346,12 +452,36 @@ def normalize_document_ids(document_ids: Iterable[int | str] | None) -> set[str]
return {str(document_id) for document_id in document_ids}
def query_similar_documents(
def visible_document_ids_for_user(user: User | None) -> list[int] | None:
"""Return the pks of documents ``user`` may view, or ``None`` for no filter.
Returns ``None`` when ``user`` is ``None`` so retrieval runs unfiltered. Used
by both the similarity-context and taxonomy-hints paths to scope RAG
neighbours to documents the requesting user is allowed to see.
"""
if user is None:
return None
visible_documents = get_objects_for_user_owner_aware(
user,
"view_document",
Document,
)
return list(visible_documents.values_list("pk", flat=True))
def retrieve_similar_nodes(
document: Document,
top_k: int = 5,
document_ids: Iterable[int | str] | None = None,
) -> list[Document]:
"""Return up to ``top_k`` Documents most similar to ``document``."""
top_k: int = 5,
) -> list["NodeWithScore"]:
"""Run ANN retrieval and return the raw NodeWithScore results.
Returns ``[]`` when the allow-list normalizes to empty, or when no index
exists yet (queuing a build in that case). The ``retrieve()`` call is a slow
embedding request, so it runs inside ``db_connection_released()`` to avoid
pinning the pooled DB connection (#12976). Both ``query_similar_documents``
and the taxonomy-hints path go through here, so they share that behavior.
"""
allowed_document_ids = normalize_document_ids(document_ids)
if allowed_document_ids is not None and not allowed_document_ids:
return []
@@ -367,30 +497,45 @@ def query_similar_documents(
from llama_index.core.retrievers import VectorIndexRetriever
index = load_or_build_index(config)
filters = (
_document_id_filters(allowed_document_ids)
if allowed_document_ids is not None
else None
)
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=top_k,
filters=filters,
)
query_text = truncate_content(
(document.title or "") + "\n" + (document.content or ""),
chunk_size=config.llm_embedding_chunk_size,
context_size=config.llm_context_size,
)
# The retrieve() call generates a query embedding (a slow external request)
# and searches the vector store; no Django ORM access happens during it, so
# release the pooled DB connection for its duration. See #12976.
with db_connection_released():
results = retriever.retrieve(query_text)
# Hold the shared read lock for the whole retrieval so the connection is
# never open across a compaction swap. The retrieve() call generates a
# query embedding (a slow external request) and searches the vector store;
# no Django ORM access happens during it, so release the pooled DB
# connection for its duration. See #12976.
with read_store() as store:
index = load_or_build_index(config, store)
retriever = VectorIndexRetriever(
index=index,
similarity_top_k=top_k,
filters=filters,
)
with db_connection_released():
return retriever.retrieve(query_text)
def query_similar_documents(
document: Document,
top_k: int = 5,
document_ids: Iterable[int | str] | None = None,
) -> list[Document]:
"""Return up to ``top_k`` Documents most similar to ``document``."""
allowed_document_ids = normalize_document_ids(document_ids)
results = retrieve_similar_nodes(
document=document,
document_ids=allowed_document_ids,
top_k=top_k,
)
retrieved_document_ids: list[int] = []
for node in results:
+38 -11
View File
@@ -15,40 +15,56 @@ MATCH_THRESHOLD = 0.8
logger = logging.getLogger("paperless_ai.matching")
def match_tags_by_name(names: list[str], user: User) -> list[Tag]:
def match_tags_by_name(
names: list[str],
user: User,
hinted_names: set[str] | None = None,
) -> list[Tag]:
queryset = get_objects_for_user_owner_aware(
user,
["view_tag"],
Tag,
)
return _match_names_to_queryset(names, queryset, "name")
return _match_names_to_queryset(names, queryset, "name", hinted_names)
def match_correspondents_by_name(names: list[str], user: User) -> list[Correspondent]:
def match_correspondents_by_name(
names: list[str],
user: User,
hinted_names: set[str] | None = None,
) -> list[Correspondent]:
queryset = get_objects_for_user_owner_aware(
user,
["view_correspondent"],
Correspondent,
)
return _match_names_to_queryset(names, queryset, "name")
return _match_names_to_queryset(names, queryset, "name", hinted_names)
def match_document_types_by_name(names: list[str], user: User) -> list[DocumentType]:
def match_document_types_by_name(
names: list[str],
user: User,
hinted_names: set[str] | None = None,
) -> list[DocumentType]:
queryset = get_objects_for_user_owner_aware(
user,
["view_documenttype"],
DocumentType,
)
return _match_names_to_queryset(names, queryset, "name")
return _match_names_to_queryset(names, queryset, "name", hinted_names)
def match_storage_paths_by_name(names: list[str], user: User) -> list[StoragePath]:
def match_storage_paths_by_name(
names: list[str],
user: User,
hinted_names: set[str] | None = None,
) -> list[StoragePath]:
queryset = get_objects_for_user_owner_aware(
user,
["view_storagepath"],
StoragePath,
)
return _match_names_to_queryset(names, queryset, "name")
return _match_names_to_queryset(names, queryset, "name", hinted_names)
def _normalize(s: str) -> str:
@@ -58,10 +74,18 @@ def _normalize(s: str) -> str:
return s
def _match_names_to_queryset(names: list[str], queryset, attr: str):
def _match_names_to_queryset(
names: list[str],
queryset,
attr: str,
hinted_names: set[str] | None = None,
):
results = []
objects = list(queryset)
object_names = [_normalize(getattr(obj, attr)) for obj in objects]
normalized_hints = (
{_normalize(name) for name in hinted_names} if hinted_names else set()
)
for name in names:
if not name:
@@ -76,6 +100,11 @@ def _match_names_to_queryset(names: list[str], queryset, attr: str):
results.append(matched)
continue
# A hinted name that didn't exact-match came from existing taxonomy
# verbatim; do not fuzzy-map it onto a different object.
if target in normalized_hints:
continue
# Fuzzy match fallback
matches = difflib.get_close_matches(
target,
@@ -88,8 +117,6 @@ def _match_names_to_queryset(names: list[str], queryset, attr: str):
matched = objects.pop(index)
object_names.pop(index)
results.append(matched)
else:
pass
return results
+115
View File
@@ -0,0 +1,115 @@
from typing import TYPE_CHECKING
from typing import TypedDict
from django.contrib.auth.models import User
from documents.models import Document
from paperless.config import AIConfig
from paperless_ai.indexing import retrieve_similar_nodes
from paperless_ai.indexing import visible_document_ids_for_user
if TYPE_CHECKING:
from llama_index.core.schema import NodeWithScore
class TaxonomyHints(TypedDict):
tags: list[str]
document_types: list[str]
correspondents: list[str]
storage_paths: list[str]
def build_taxonomy_hints_from_nodes(
nodes: list["NodeWithScore"],
) -> TaxonomyHints:
"""Collect the unique, sorted taxonomy names carried on retrieved nodes.
Reads ``tags`` (a list), ``document_type``, ``correspondent``, and
``storage_path`` from each node's metadata. Empty / ``None`` values and
missing keys are skipped. The result is naturally bounded by the retrieval
``top_k``, so no cap is applied.
"""
tags: set[str] = set()
document_types: set[str] = set()
correspondents: set[str] = set()
storage_paths: set[str] = set()
for node in nodes:
metadata = node.metadata or {}
for tag in metadata.get("tags") or []:
if tag:
tags.add(tag)
document_type = metadata.get("document_type")
if document_type:
document_types.add(document_type)
correspondent = metadata.get("correspondent")
if correspondent:
correspondents.add(correspondent)
storage_path = metadata.get("storage_path")
if storage_path:
storage_paths.add(storage_path)
return TaxonomyHints(
tags=sorted(tags),
document_types=sorted(document_types),
correspondents=sorted(correspondents),
storage_paths=sorted(storage_paths),
)
_HINT_INSTRUCTION = (
"Prefer existing names from these lists verbatim. Only propose a new value "
"if none of the existing names fits."
)
def format_hints_for_prompt(hints: TaxonomyHints) -> str:
"""Render non-empty hint categories as labelled blocks plus one instruction.
Returns "" when every category is empty, so callers can treat the result
the same as no hints at all.
"""
# Literal-key access keeps this TypedDict-safe for mypy; the order here is
# the order the blocks appear in the prompt.
labelled_values: list[tuple[str, list[str]]] = [
("Available tags", hints["tags"]),
("Available document types", hints["document_types"]),
("Available correspondents", hints["correspondents"]),
("Available storage paths", hints["storage_paths"]),
]
blocks: list[str] = []
for label, values in labelled_values:
if values:
listing = "\n".join(f"- {value}" for value in values)
blocks.append(f"{label}:\n{listing}")
if not blocks:
return ""
return "\n\n".join([*blocks, _HINT_INSTRUCTION])
def get_taxonomy_hints_for_document(
document: Document,
user: User | None,
) -> TaxonomyHints | None:
"""Build taxonomy hints from a document's RAG neighbours.
Returns ``None`` when no embedding backend is configured (the gate) so the
caller's prompt and matching are identical to today. Otherwise returns a
``TaxonomyHints`` -- possibly all-empty when no similar documents exist.
Applies the same owner-aware visible-document filter as
``get_context_for_document``.
"""
if not AIConfig().llm_embedding_backend:
return None
nodes = retrieve_similar_nodes(
document=document,
document_ids=visible_document_ids_for_user(user),
)
return build_taxonomy_hints_from_nodes(nodes)
+1
View File
@@ -10,6 +10,7 @@ from pytest_django.fixtures import SettingsWrapper
def temp_llm_index_dir(tmp_path: Path, settings: SettingsWrapper) -> Path:
settings.LLM_INDEX_DIR = tmp_path
settings.LLM_INDEX_LOCK = tmp_path / "index.lock"
settings.LLM_INDEX_RWLOCK = tmp_path / "llmindex.rwlock.db"
return tmp_path
@@ -1,8 +1,11 @@
import json
from types import SimpleNamespace
from typing import cast
from unittest.mock import MagicMock
from unittest.mock import patch
import pytest
import pytest_mock
from django.test import override_settings
from documents.models import Document
@@ -261,3 +264,111 @@ def test_get_context_for_document_no_similar_docs(mock_document):
with patch("paperless_ai.ai_classifier.query_similar_documents", return_value=[]):
result = get_context_for_document(mock_document)
assert result == ""
class TestPromptHints:
@pytest.fixture
def config(self) -> AIConfig:
# build_prompt_* only read these two numeric settings off config;
# a stand-in avoids constructing a DB-backed AIConfig.
return cast(
"AIConfig",
SimpleNamespace(llm_embedding_chunk_size=1000, llm_context_size=8000),
)
def test_without_rag_includes_hints_block(
self,
mock_document: MagicMock,
config: AIConfig,
) -> None:
hints = {
"tags": ["Bloodwork"],
"document_types": ["Invoice"],
"correspondents": [],
"storage_paths": [],
}
prompt = build_prompt_without_rag(mock_document, config, hints=hints)
assert "Available tags:" in prompt
assert "- Bloodwork" in prompt
assert "Prefer existing names from these lists verbatim" in prompt
def test_without_rag_none_matches_baseline(
self,
mock_document: MagicMock,
config: AIConfig,
) -> None:
baseline = build_prompt_without_rag(mock_document, config)
with_none = build_prompt_without_rag(mock_document, config, hints=None)
assert with_none == baseline
assert "Available tags:" not in with_none
def test_with_rag_includes_context_and_hints(
self,
mock_document: MagicMock,
config: AIConfig,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.ai_classifier.get_context_for_document",
return_value="TITLE: Neighbour\nsome context",
)
hints = {
"tags": ["Bloodwork"],
"document_types": [],
"correspondents": [],
"storage_paths": [],
}
prompt = build_prompt_with_rag(mock_document, config, user=None, hints=hints)
assert "Additional context from similar documents" in prompt
assert "Available tags:" in prompt
def test_classification_forwards_hints(
self,
mock_document: MagicMock,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.ai_classifier.AIConfig",
return_value=SimpleNamespace(
llm_embedding_backend=None,
llm_embedding_chunk_size=1000,
llm_context_size=8000,
),
)
build = mocker.patch(
"paperless_ai.ai_classifier.build_prompt_without_rag",
return_value="PROMPT",
)
mock_client = MagicMock()
mock_client.run_llm_query.return_value = {
"title": "t",
"tags": [],
"correspondents": [],
"document_types": [],
"storage_paths": [],
"dates": [],
}
mocker.patch("paperless_ai.ai_classifier.AIClient", return_value=mock_client)
hints = {
"tags": ["Bloodwork"],
"document_types": [],
"correspondents": [],
"storage_paths": [],
}
result = get_ai_document_classification(
mock_document,
user=None,
hints=hints,
)
_, build_kwargs = build.call_args
assert build_kwargs["hints"] == hints
assert set(result.keys()) == {
"title",
"tags",
"correspondents",
"document_types",
"storage_paths",
"dates",
}
+113 -71
View File
@@ -1,5 +1,5 @@
import json
from pathlib import Path
from types import SimpleNamespace
from unittest.mock import MagicMock
from unittest.mock import patch
@@ -7,6 +7,7 @@ import pytest
import pytest_mock
from django.test import override_settings
from django.utils import timezone
from llama_index.core.schema import MetadataMode
from documents.models import Document
from documents.models import PaperlessTask
@@ -17,6 +18,7 @@ from documents.tests.factories import PaperlessTaskFactory
from paperless.models import ApplicationConfiguration
from paperless_ai import indexing
from paperless_ai.tests.conftest import FakeEmbedding
from paperless_ai.vector_store import PaperlessSqliteVecVectorStore
@pytest.fixture
@@ -33,12 +35,22 @@ def test_build_document_node(real_document: Document) -> None:
nodes = indexing.build_document_node(real_document)
assert len(nodes) > 0
assert nodes[0].metadata["document_id"] == str(real_document.id)
assert nodes[0].metadata["filename"] == real_document.filename
assert nodes[0].metadata["storage_path"] == (
real_document.storage_path.name if real_document.storage_path else None
)
assert (
nodes[0].metadata["archive_serial_number"]
== real_document.archive_serial_number
)
assert "filename" in nodes[0].excluded_embed_metadata_keys
assert "filename" not in nodes[0].excluded_llm_metadata_keys
@pytest.mark.django_db
def test_build_document_node_sets_ref_doc_id(real_document: Document) -> None:
"""Every node produced by build_document_node must carry the paperless document id
as its ref_doc_id so that the LanceDB adapter's delete(str(doc.id)) works correctly."""
as its ref_doc_id so that the vector store's delete(str(doc.id)) works correctly."""
nodes = indexing.build_document_node(real_document)
assert len(nodes) > 0, "Expected at least one node"
for node in nodes:
@@ -58,8 +70,6 @@ def test_build_document_node_excludes_metadata_from_embedding(
double the token count and exceed embedding models with small context
windows (e.g. nomic-embed-text via Ollama defaults to num_ctx=2048).
"""
from llama_index.core.schema import MetadataMode
nodes = indexing.build_document_node(real_document)
for node in nodes:
embed_text = node.get_content(metadata_mode=MetadataMode.EMBED)
@@ -91,8 +101,6 @@ def test_build_document_node_excludes_document_id_from_llm_context(
real_document: Document,
) -> None:
"""document_id is an internal key and must not appear in LLM context text."""
from llama_index.core.schema import MetadataMode
nodes = indexing.build_document_node(real_document)
assert len(nodes) > 0
for node in nodes:
@@ -154,29 +162,6 @@ def test_update_llm_index(
build_document_node.assert_called_once_with(real_document, chunk_size=512)
@pytest.mark.django_db
def test_update_llm_index_cleans_stale_meta_on_rebuild(
temp_llm_index_dir: Path,
real_document: Document,
mock_embed_model: FakeEmbedding,
) -> None:
# A meta.json left over from the FAISS era (or written by older code) must be
# deleted on rebuild so stale artifacts don't accumulate on disk.
stale_meta = temp_llm_index_dir / "meta.json"
stale_meta.write_text(json.dumps({"embedding_model": "old", "dim": 1}))
with patch("documents.models.Document.objects.all") as mock_all:
mock_queryset = MagicMock()
mock_queryset.exists.return_value = True
mock_queryset.__iter__.return_value = iter([real_document])
mock_all.return_value = mock_queryset
indexing.update_llm_index(rebuild=True)
assert not stale_meta.exists(), (
"update_llm_index(rebuild=True) must remove stale meta.json"
)
@pytest.mark.django_db
def test_update_llm_index_rebuilds_on_model_name_change(
temp_llm_index_dir: Path,
@@ -207,10 +192,10 @@ def test_update_llm_index_rebuilds_on_model_name_change(
):
indexing.update_llm_index(rebuild=False)
store = indexing.get_vector_store()
# Schema metadata only updates when the table is dropped and recreated, never on
# incremental writes -- so "model-b" here proves a full rebuild happened.
assert store.stored_model_name() == "model-b"
with indexing.get_vector_store() as store:
# Schema metadata only updates when the table is dropped and recreated, never
# on incremental writes -- so "model-b" here proves a full rebuild happened.
assert store.stored_model_name() == "model-b"
@pytest.mark.django_db
@@ -254,10 +239,10 @@ def test_update_llm_index_partial_update(
indexing.update_llm_index(rebuild=False)
store = indexing.get_vector_store()
assert store.table_exists(), (
"Expected the LanceDB table to exist after incremental update"
)
with indexing.get_vector_store() as store:
assert store.table_exists(), (
"Expected the vector store table to exist after incremental update"
)
@pytest.mark.django_db
@@ -269,10 +254,10 @@ def test_add_or_update_document_updates_existing_entry(
indexing.update_llm_index(rebuild=True)
indexing.llm_index_add_or_update_document(real_document)
store = indexing.get_vector_store()
assert store.table_exists(), (
"Expected the LanceDB table to exist after add-or-update"
)
with indexing.get_vector_store() as store:
assert store.table_exists(), (
"Expected the vector store table to exist after add-or-update"
)
@pytest.mark.django_db
@@ -461,7 +446,7 @@ def test_query_similar_documents_empty_allow_list_fails_closed(
class TestUpdateLlmIndexEmptyDocumentSet:
"""update_llm_index must clear the LanceDB table when all documents are deleted.
"""update_llm_index must clear the vector store table when all documents are deleted.
Without this, the stale vectors are never cleared and subsequent similarity
searches return phantom hits for document IDs that no longer exist in the DB.
@@ -489,10 +474,11 @@ class TestUpdateLlmIndexEmptyDocumentSet:
)
indexing.update_llm_index(rebuild=True)
store = indexing.get_vector_store()
assert store.table_exists(), (
"Precondition failed: expected the LanceDB table to exist before deletion"
)
with indexing.get_vector_store() as store:
assert store.table_exists(), (
"Precondition failed: expected the vector store table to exist "
"before deletion"
)
# Step 2: delete all documents
Document.objects.all().delete()
@@ -503,10 +489,11 @@ class TestUpdateLlmIndexEmptyDocumentSet:
indexing.update_llm_index(rebuild=True)
# Step 4: the table must be absent (no rows) — phantom vectors gone
store2 = indexing.get_vector_store()
assert not store2.table_exists(), (
"Expected the LanceDB table to be absent after rebuilding with no documents"
)
with indexing.get_vector_store() as store2:
assert not store2.table_exists(), (
"Expected the vector store table to be absent after rebuilding "
"with no documents"
)
class TestDocumentUpdatedSignalTriggersLlmReindex:
@@ -578,11 +565,11 @@ class TestLlmIndexAddOrUpdateDocumentEmptyContent:
@pytest.mark.django_db
def test_llm_index_compact_uses_zero_retention(
def test_llm_index_compact_uses_force(
temp_llm_index_dir: Path,
mocker: pytest_mock.MockerFixture,
) -> None:
"""compact must use retention_seconds=0 to clear all MVCC history immediately."""
"""compact must use force=True to rebuild the table and reclaim space immediately."""
mock_store = mocker.MagicMock()
mocker.patch(
"paperless_ai.indexing.write_store",
@@ -594,7 +581,7 @@ def test_llm_index_compact_uses_zero_retention(
indexing.llm_index_compact()
mock_store.compact.assert_called_once_with(retention_seconds=0)
mock_store.compact.assert_called_once_with(force=True)
@pytest.mark.django_db
@@ -678,16 +665,14 @@ class TestLlmIndexLocking:
@pytest.mark.django_db
@pytest.mark.django_db
class TestLanceDbIndexing:
class TestVectorStoreIndexing:
def test_get_vector_store_roundtrip(
self,
temp_llm_index_dir: Path,
mock_embed_model: FakeEmbedding,
) -> None:
from paperless_ai.vector_store import PaperlessLanceVectorStore
store = indexing.get_vector_store()
assert isinstance(store, PaperlessLanceVectorStore)
with indexing.get_vector_store() as store:
assert isinstance(store, PaperlessSqliteVecVectorStore)
def test_add_then_remove_document(
self,
@@ -696,12 +681,13 @@ class TestLanceDbIndexing:
real_document: Document,
) -> None:
indexing.llm_index_add_or_update_document(real_document)
store = indexing.get_vector_store()
table = store.client.open_table(indexing.LLM_INDEX_TABLE)
assert table.count_rows() >= 1
with indexing.get_vector_store() as store:
assert store.table_exists()
count_sql = "SELECT count(*) FROM documents"
assert store.client.execute(count_sql).fetchone()[0] >= 1
indexing.llm_index_remove_document(real_document)
assert store.client.open_table(indexing.LLM_INDEX_TABLE).count_rows() == 0
indexing.llm_index_remove_document(real_document)
assert store.client.execute(count_sql).fetchone()[0] == 0
def test_update_shrinks_chunks_without_orphans(
self,
@@ -712,16 +698,17 @@ class TestLanceDbIndexing:
real_document.content = "word " * 4000 # many chunks
real_document.save()
indexing.llm_index_add_or_update_document(real_document)
store = indexing.get_vector_store()
big = store.client.open_table(indexing.LLM_INDEX_TABLE).count_rows()
count_sql = "SELECT count(*) FROM documents"
with indexing.get_vector_store() as store:
big = store.client.execute(count_sql).fetchone()[0]
real_document.content = "short" # one chunk
real_document.save()
indexing.llm_index_add_or_update_document(real_document)
real_document.content = "short" # one chunk
real_document.save()
indexing.llm_index_add_or_update_document(real_document)
rows = store.client.open_table(indexing.LLM_INDEX_TABLE).count_rows()
assert rows < big
assert rows >= 1
rows = store.client.execute(count_sql).fetchone()[0]
assert rows < big
assert rows >= 1
@pytest.mark.django_db
@@ -740,3 +727,58 @@ class TestQuerySimilarDocuments:
results = indexing.query_similar_documents(a, document_ids=[b.id])
assert all(doc.id == b.id for doc in results)
class TestRetrieveSimilarNodes:
@pytest.mark.django_db
def test_returns_raw_nodes_from_retriever(
self,
temp_llm_index_dir: Path,
real_document: Document,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch("paperless_ai.indexing.llm_index_exists", return_value=True)
mocker.patch("paperless_ai.indexing.load_or_build_index")
node1 = SimpleNamespace(metadata={"document_id": "1"})
node2 = SimpleNamespace(metadata={"document_id": "2"})
retriever = mocker.MagicMock()
retriever.retrieve.return_value = [node1, node2]
mocker.patch(
"llama_index.core.retrievers.VectorIndexRetriever",
return_value=retriever,
)
result = indexing.retrieve_similar_nodes(real_document, top_k=3)
assert result == [node1, node2]
@pytest.mark.django_db
def test_empty_allow_list_fails_closed(
self,
real_document: Document,
mocker: pytest_mock.MockerFixture,
) -> None:
load = mocker.patch("paperless_ai.indexing.load_or_build_index")
result = indexing.retrieve_similar_nodes(real_document, document_ids=[])
assert result == []
load.assert_not_called()
@pytest.mark.django_db
def test_queues_update_when_index_missing(
self,
temp_llm_index_dir: Path,
real_document: Document,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch("paperless_ai.indexing.llm_index_exists", return_value=False)
queue = mocker.patch("paperless_ai.indexing.queue_llm_index_update_if_needed")
result = indexing.retrieve_similar_nodes(real_document, top_k=2)
assert result == []
queue.assert_called_once_with(
rebuild=False,
reason="LLM index not found for similarity query.",
)
+4 -8
View File
@@ -3,9 +3,13 @@ from unittest.mock import MagicMock
from unittest.mock import patch
import pytest
from llama_index.core import settings as llama_settings
from llama_index.core.embeddings.mock_embed_model import MockEmbedding
from llama_index.core.schema import TextNode
from documents.tests.factories import DocumentFactory
from paperless_ai import chat
from paperless_ai import indexing
from paperless_ai.chat import CHAT_ERROR_MESSAGE
from paperless_ai.chat import CHAT_METADATA_DELIMITER
from paperless_ai.chat import stream_chat_with_documents
@@ -13,9 +17,6 @@ from paperless_ai.chat import stream_chat_with_documents
@pytest.fixture(autouse=True)
def patch_embed_model():
from llama_index.core import settings as llama_settings
from llama_index.core.embeddings.mock_embed_model import MockEmbedding
# Use a real BaseEmbedding subclass to satisfy llama-index 0.14 validation
llama_settings.Settings.embed_model = MockEmbedding(embed_dim=1536)
yield
@@ -241,8 +242,6 @@ class TestStreamChatRetrieval:
temp_llm_index_dir,
mock_embed_model,
) -> None:
from documents.tests.factories import DocumentFactory
doc = DocumentFactory.create(content="hello world")
# Nothing indexed for this document yet.
out = list(chat.stream_chat_with_documents("question?", [doc]))
@@ -258,9 +257,6 @@ class TestStreamChatRetrieval:
requested documents only — content from other indexed documents must
not be surfaced.
"""
from documents.tests.factories import DocumentFactory
from paperless_ai import indexing
included = DocumentFactory.create(content="included document content")
excluded = DocumentFactory.create(content="excluded document content")
indexing.llm_index_add_or_update_document(included)
+4 -2
View File
@@ -224,15 +224,17 @@ def test_build_llm_index_text(mock_document):
result = build_llm_index_text(mock_document)
# Structured fields live in node.metadata for LLM context not body text
# Structured fields live in node.metadata for LLM context -- not body text
assert "Title: Test Title" not in result
assert "Created: 2023-01-01" not in result
assert "Tags: Tag1, Tag2" not in result
assert "Document Type: Invoice" not in result
assert "Correspondent: Test Correspondent" not in result
assert "Filename:" not in result
assert "Storage Path:" not in result
assert "Archive Serial Number:" not in result
# Fields without a metadata equivalent stay in body text
assert "Filename: test_file.pdf" in result
assert "Notes: Note1,Note2" in result
assert "Content:\n\nThis is the document content." in result
assert "Custom Field - Field1: Value1\nCustom Field - Field2: Value2" in result
@@ -0,0 +1,134 @@
import logging
import sqlite3
import threading
from pathlib import Path
from unittest.mock import MagicMock
import pytest
from django.conf import settings
from filelock import ReadWriteLock
from llama_index.core.schema import TextNode
from pytest_django.fixtures import SettingsWrapper
from paperless_ai import indexing
from paperless_ai.vector_store import PaperlessSqliteVecVectorStore
DIM = 8
def _node(node_id: str, document_id: str, *, seed: float = 0.0) -> TextNode:
node = TextNode(
id_=node_id,
text="chunk",
metadata={"document_id": document_id, "modified": "2026-06-01T00:00:00"},
)
node.relationships = {}
node.embedding = [seed + i / 100 for i in range(DIM)]
return node
def _seed_bloated_index(index_dir: Path) -> None:
"""Create an index whose cumulative inserts far exceed live rows."""
store = PaperlessSqliteVecVectorStore(uri=str(index_dir))
store.add([_node(f"d{j}", str(j), seed=float(j)) for j in range(20)])
for cycle in range(6):
for j in range(20):
store.upsert_document(
str(j),
[_node(f"d{j}-c{cycle}", str(j), seed=float(j))],
)
store.client.close()
def _bloat_ratio(index_dir: Path) -> float:
store = PaperlessSqliteVecVectorStore(uri=str(index_dir))
live = store.client.execute("SELECT count(*) FROM documents").fetchone()[0]
row = store.client.execute(
"SELECT value FROM index_meta WHERE key = 'total_inserts'",
).fetchone()
total = int(row["value"]) if row else live
store.client.close()
return total / max(live, 1)
def _integrity_ok(index_dir: Path) -> bool:
store = PaperlessSqliteVecVectorStore(uri=str(index_dir))
result = store.client.execute("PRAGMA integrity_check").fetchone()[0]
rows = store.client.execute("SELECT count(*) FROM documents").fetchone()[0]
store.client.close()
return result == "ok" and rows == 20
def _reader_lock() -> ReadWriteLock:
# A distinct instance simulates a reader in another process: it coordinates
# with the production lock purely through SQLite, never reentrant upgrade.
return ReadWriteLock(str(settings.LLM_INDEX_RWLOCK), is_singleton=False)
class TestCompactionLock:
def test_compaction_skips_when_a_reader_holds_the_lock(
self,
temp_llm_index_dir: Path,
settings: SettingsWrapper,
caplog: pytest.LogCaptureFixture,
) -> None:
_seed_bloated_index(temp_llm_index_dir)
settings.LLM_INDEX_COMPACTION_LOCK_TIMEOUT = 0.3
lock = _reader_lock()
with lock.read_lock(), caplog.at_level(logging.INFO):
indexing.llm_index_compact() # must not raise
lock.close()
# Swap was skipped: bloat remains, nothing corrupted, data intact.
assert _integrity_ok(temp_llm_index_dir)
assert _bloat_ratio(temp_llm_index_dir) > 2
assert "Skipping LLM index compaction" in caplog.text
def test_compaction_runs_when_no_reader_holds_the_lock(
self,
temp_llm_index_dir: Path,
) -> None:
_seed_bloated_index(temp_llm_index_dir)
assert _bloat_ratio(temp_llm_index_dir) > 2
indexing.llm_index_compact()
assert _bloat_ratio(temp_llm_index_dir) == pytest.approx(1.0)
assert _integrity_ok(temp_llm_index_dir)
def test_normal_write_is_not_gated_by_the_compaction_lock(
self,
temp_llm_index_dir: Path,
) -> None:
"""A held exclusive lock must not block ordinary writes (WAL handles them)."""
_seed_bloated_index(temp_llm_index_dir)
done = threading.Event()
def remove() -> None:
indexing.llm_index_remove_document(MagicMock(id=999))
done.set()
holder = _reader_lock()
with holder.write_lock():
t = threading.Thread(target=remove)
t.start()
finished = done.wait(timeout=5)
t.join(timeout=2)
holder.close()
assert finished, "a normal write blocked on the compaction lock"
class TestReadStore:
def test_closes_connection_on_exit(self, temp_llm_index_dir: Path) -> None:
with indexing.read_store() as store:
conn = store.client
assert conn.execute("SELECT 1").fetchone()[0] == 1
with pytest.raises(sqlite3.ProgrammingError):
conn.execute("SELECT 1")
def test_concurrent_readers_do_not_block(self, temp_llm_index_dir: Path) -> None:
_seed_bloated_index(temp_llm_index_dir)
with indexing.read_store() as a, indexing.read_store() as b:
assert a.table_exists()
assert b.table_exists()
+1 -1
View File
@@ -12,7 +12,7 @@ class TestLazyAiImports:
"os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'paperless.settings')\n"
"django.setup()\n"
"import documents.tasks # noqa: F401\n"
"leaked = [m for m in ('lancedb', 'pyarrow', 'llama_index') "
"leaked = [m for m in ('lancedb', 'pyarrow', 'llama_index', 'sqlite_vec') "
"if m in sys.modules]\n"
"assert not leaked, f'AI libraries leaked into the light path: {leaked}'\n"
)
+92
View File
@@ -1,12 +1,15 @@
import difflib
from unittest.mock import patch
import pytest
import pytest_mock
from django.test import TestCase
from documents.models import Correspondent
from documents.models import DocumentType
from documents.models import StoragePath
from documents.models import Tag
from documents.tests.factories import TagFactory
from paperless_ai.matching import extract_unmatched_names
from paperless_ai.matching import match_correspondents_by_name
from paperless_ai.matching import match_document_types_by_name
@@ -87,6 +90,95 @@ class TestAIMatching(TestCase):
self.assertEqual(result[1].name, "Test Tag 2")
class TestHintedMatching:
def test_hinted_verbatim_skips_fuzzy(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.matching.get_objects_for_user_owner_aware",
return_value=[TagFactory.build(name="Bloodwork")],
)
spy = mocker.spy(difflib, "get_close_matches")
result = match_tags_by_name(
["Bloodwork"],
user=None,
hinted_names={"Bloodwork"},
)
assert [t.name for t in result] == ["Bloodwork"]
spy.assert_not_called()
def test_unhinted_name_still_fuzzy_matches(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.matching.get_objects_for_user_owner_aware",
return_value=[TagFactory.build(name="Bloodwork")],
)
# "Bloodwrok" is a typo not in hints -> fuzzy still maps it to Bloodwork.
result = match_tags_by_name(
["Bloodwrok"],
user=None,
hinted_names={"Taxes"},
)
assert [t.name for t in result] == ["Bloodwork"]
def test_hinted_name_with_whitespace_exact_matches(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.matching.get_objects_for_user_owner_aware",
return_value=[TagFactory.build(name="Bloodwork")],
)
spy = mocker.spy(difflib, "get_close_matches")
result = match_tags_by_name(
["Bloodwork "],
user=None,
hinted_names={"Bloodwork"},
)
assert [t.name for t in result] == ["Bloodwork"]
spy.assert_not_called()
def test_hinted_name_absent_from_queryset_is_skipped_not_fuzzed(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
# A hint with no exact object must not fall through to fuzzy.
mocker.patch(
"paperless_ai.matching.get_objects_for_user_owner_aware",
return_value=[TagFactory.build(name="Bloodwork")],
)
result = match_tags_by_name(
["Bloodwrok"],
user=None,
hinted_names={"Bloodwrok"},
)
assert result == []
def test_backward_compatible_without_kwarg(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.matching.get_objects_for_user_owner_aware",
return_value=[TagFactory.build(name="Test Tag 1")],
)
result = match_tags_by_name(["Test Tag 1", "Nonexistent"], user=None)
assert [t.name for t in result] == ["Test Tag 1"]
@pytest.mark.django_db
class TestExtractUnmatchedNamesNormalization:
def test_punctuated_name_already_matched_is_not_returned_as_unmatched(
+220
View File
@@ -0,0 +1,220 @@
from types import SimpleNamespace
import pytest_mock
from documents.tests.factories import DocumentFactory
from paperless_ai.taxonomy import TaxonomyHints
from paperless_ai.taxonomy import build_taxonomy_hints_from_nodes
from paperless_ai.taxonomy import format_hints_for_prompt
from paperless_ai.taxonomy import get_taxonomy_hints_for_document
def make_node(**metadata: object) -> SimpleNamespace:
"""A stand-in for NodeWithScore: only ``.metadata`` is accessed."""
return SimpleNamespace(metadata=metadata)
class TestBuildTaxonomyHintsFromNodes:
def test_returns_all_four_keys(self) -> None:
hints = build_taxonomy_hints_from_nodes([])
assert set(hints.keys()) == {
"tags",
"document_types",
"correspondents",
"storage_paths",
}
def test_collects_and_sorts_values(self) -> None:
nodes = [
make_node(
tags=["Taxes", "Bloodwork"],
document_type="Invoice",
correspondent="IRS",
storage_path="Financial",
),
]
hints = build_taxonomy_hints_from_nodes(nodes)
assert hints["tags"] == ["Bloodwork", "Taxes"]
assert hints["document_types"] == ["Invoice"]
assert hints["correspondents"] == ["IRS"]
assert hints["storage_paths"] == ["Financial"]
def test_deduplicates_across_nodes(self) -> None:
nodes = [
make_node(tags=["Taxes"], document_type="Invoice"),
make_node(tags=["Taxes", "Medical"], document_type="Invoice"),
]
hints = build_taxonomy_hints_from_nodes(nodes)
assert hints["tags"] == ["Medical", "Taxes"]
assert hints["document_types"] == ["Invoice"]
def test_none_values_skipped(self) -> None:
nodes = [
make_node(
tags=["Taxes", None, ""],
document_type=None,
correspondent=None,
storage_path=None,
),
]
hints = build_taxonomy_hints_from_nodes(nodes)
assert hints["tags"] == ["Taxes"]
assert hints["document_types"] == []
assert hints["correspondents"] == []
assert hints["storage_paths"] == []
def test_missing_storage_path_key_handled(self) -> None:
# Pre-enrichment nodes have no storage_path key at all.
nodes = [make_node(tags=["Taxes"], document_type="Invoice")]
hints = build_taxonomy_hints_from_nodes(nodes)
assert hints["storage_paths"] == []
def test_empty_node_list_all_empty(self) -> None:
hints = build_taxonomy_hints_from_nodes([])
assert hints == {
"tags": [],
"document_types": [],
"correspondents": [],
"storage_paths": [],
}
def test_output_stable_across_calls(self) -> None:
nodes = [make_node(tags=["b", "a", "c"])]
assert build_taxonomy_hints_from_nodes(
nodes,
) == build_taxonomy_hints_from_nodes(nodes)
class TestFormatHintsForPrompt:
def test_all_blocks_present_when_all_categories_nonempty(self) -> None:
hints: TaxonomyHints = {
"tags": ["Bloodwork"],
"document_types": ["Invoice"],
"correspondents": ["IRS"],
"storage_paths": ["Financial"],
}
result = format_hints_for_prompt(hints)
assert "Available tags:" in result
assert "Available document types:" in result
assert "Available correspondents:" in result
assert "Available storage paths:" in result
assert "- Bloodwork" in result
def test_empty_category_produces_no_block(self) -> None:
hints: TaxonomyHints = {
"tags": ["Bloodwork"],
"document_types": [],
"correspondents": [],
"storage_paths": [],
}
result = format_hints_for_prompt(hints)
assert "Available tags:" in result
assert "Available document types:" not in result
assert "Available correspondents:" not in result
assert "Available storage paths:" not in result
def test_all_empty_produces_empty_string(self) -> None:
hints: TaxonomyHints = {
"tags": [],
"document_types": [],
"correspondents": [],
"storage_paths": [],
}
assert format_hints_for_prompt(hints) == ""
def test_instruction_line_appears_once(self) -> None:
hints: TaxonomyHints = {
"tags": ["Bloodwork"],
"document_types": ["Invoice"],
"correspondents": [],
"storage_paths": [],
}
result = format_hints_for_prompt(hints)
assert result.count("Prefer existing names from these lists verbatim") == 1
class TestGetTaxonomyHintsForDocument:
def test_returns_none_when_embedding_backend_off(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.taxonomy.AIConfig",
return_value=SimpleNamespace(llm_embedding_backend=None),
)
retrieve = mocker.patch("paperless_ai.taxonomy.retrieve_similar_nodes")
result = get_taxonomy_hints_for_document(DocumentFactory.build(), user=None)
assert result is None
retrieve.assert_not_called()
def test_passes_owner_aware_ids_when_user_present(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.taxonomy.AIConfig",
return_value=SimpleNamespace(llm_embedding_backend="huggingface"),
)
mocker.patch(
"paperless_ai.taxonomy.visible_document_ids_for_user",
return_value=[1, 2, 3],
)
retrieve = mocker.patch(
"paperless_ai.taxonomy.retrieve_similar_nodes",
return_value=[],
)
document = DocumentFactory.build()
user = mocker.MagicMock()
get_taxonomy_hints_for_document(document, user=user)
retrieve.assert_called_once_with(
document=document,
document_ids=[1, 2, 3],
)
def test_returns_populated_hints_when_nodes_found(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.taxonomy.AIConfig",
return_value=SimpleNamespace(llm_embedding_backend="huggingface"),
)
mocker.patch(
"paperless_ai.taxonomy.retrieve_similar_nodes",
return_value=[make_node(tags=["Taxes"], document_type="Invoice")],
)
result = get_taxonomy_hints_for_document(DocumentFactory.build(), user=None)
assert result == {
"tags": ["Taxes"],
"document_types": ["Invoice"],
"correspondents": [],
"storage_paths": [],
}
def test_returns_empty_hints_not_none_when_no_nodes(
self,
mocker: pytest_mock.MockerFixture,
) -> None:
mocker.patch(
"paperless_ai.taxonomy.AIConfig",
return_value=SimpleNamespace(llm_embedding_backend="huggingface"),
)
mocker.patch(
"paperless_ai.taxonomy.retrieve_similar_nodes",
return_value=[],
)
result = get_taxonomy_hints_for_document(DocumentFactory.build(), user=None)
assert result == {
"tags": [],
"document_types": [],
"correspondents": [],
"storage_paths": [],
}
+559 -370
View File
@@ -1,417 +1,606 @@
import sqlite3
from collections.abc import Generator
from pathlib import Path
import pytest
from llama_index.core.schema import NodeRelationship
from llama_index.core.schema import RelatedNodeInfo
from llama_index.core.schema import TextNode
from llama_index.core.vector_stores.types import FilterOperator
from llama_index.core.vector_stores.types import MetadataFilter
from llama_index.core.vector_stores.types import MetadataFilters
from llama_index.core.vector_stores.types import VectorStoreQuery
from paperless_ai.vector_store import PaperlessLanceVectorStore
from paperless_ai.vector_store import DB_FILENAME
from paperless_ai.vector_store import DEFAULT_TABLE_NAME
from paperless_ai.vector_store import MIGRATIONS
from paperless_ai.vector_store import SCHEMA_VERSION
from paperless_ai.vector_store import Migration
from paperless_ai.vector_store import PaperlessSqliteVecVectorStore
from paperless_ai.vector_store import _build_where
DIM = 8
DIM = 16
def _node(node_id: str, document_id: str, text: str, vec: float) -> TextNode:
node = TextNode(id_=node_id, text=text, metadata={"document_id": document_id})
node.set_content(text)
node.embedding = [vec] * DIM
# Use relationships so ref_doc_id resolves correctly (it's a read-only property)
node.relationships = {
NodeRelationship.SOURCE: RelatedNodeInfo(node_id=document_id),
}
def make_node(
node_id: str,
document_id: str,
*,
modified: str = "2026-06-10T00:00:00",
seed: float = 0.0,
text: str = "some text",
) -> TextNode:
node = TextNode(
id_=node_id,
text=text,
metadata={"document_id": document_id, "modified": modified},
)
node.relationships = {}
node.embedding = [seed + i / 100 for i in range(DIM)]
return node
class TestPaperlessLanceVectorStoreCrud:
@pytest.fixture
def store(self, tmp_path: Path) -> PaperlessLanceVectorStore:
return PaperlessLanceVectorStore(uri=str(tmp_path / "idx"))
@pytest.fixture
def store(tmp_path: Path) -> Generator[PaperlessSqliteVecVectorStore, None, None]:
with PaperlessSqliteVecVectorStore(uri=str(tmp_path)) as store:
yield store
def test_add_then_query_returns_node(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.add([_node("1-0", "1", "alpha", 0.1), _node("2-0", "2", "beta", 0.9)])
result = store.query(
VectorStoreQuery(query_embedding=[0.1] * DIM, similarity_top_k=1),
)
def _query(
store: PaperlessSqliteVecVectorStore,
embedding: list[float],
top_k: int = 5,
filters=None,
):
return store.query(
VectorStoreQuery(
query_embedding=embedding,
similarity_top_k=top_k,
filters=filters,
),
)
assert len(result.nodes) == 1
def _eq_filter(key: str, value: str):
return MetadataFilters(
filters=[MetadataFilter(key=key, operator=FilterOperator.EQ, value=value)],
)
def _in_filter(document_ids: list[str]):
return MetadataFilters(
filters=[
MetadataFilter(
key="document_id",
operator=FilterOperator.IN,
value=document_ids,
),
],
)
class TestCrud:
def test_add_then_query_returns_node(self, store) -> None:
node = make_node("n1", "1")
assert store.add([node]) == ["n1"]
result = _query(store, node.embedding, top_k=1)
assert result.ids == ["n1"]
assert result.nodes[0].metadata["document_id"] == "1"
# cosine distance of the identical vector is 0 -> similarity 1
assert result.similarities[0] == pytest.approx(1.0)
def test_query_empty_table_returns_empty_no_raise(
self,
store: PaperlessLanceVectorStore,
) -> None:
result = store.query(
VectorStoreQuery(query_embedding=[0.1] * DIM, similarity_top_k=5),
)
assert result.nodes == []
assert result.ids == []
def test_query_empty_store_returns_empty_no_raise(self, store) -> None:
result = _query(store, [0.0] * DIM)
assert result.ids == [] and result.nodes == [] and result.similarities == []
def test_delete_removes_all_chunks_of_document(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.add([_node("1-0", "1", "a", 0.1), _node("1-1", "1", "b", 0.2)])
store.add([_node("2-0", "2", "c", 0.9)])
def test_add_empty_list_is_noop(self, store) -> None:
assert store.add([]) == []
assert not store.table_exists()
def test_delete_removes_all_chunks_of_document(self, store) -> None:
store.add([make_node("a1", "1"), make_node("a2", "1"), make_node("b1", "2")])
store.delete("1")
result = _query(store, [0.0] * DIM, top_k=10)
assert result.ids == ["b1"]
assert store.client.open_table("documents").count_rows() == 1
def test_query_with_in_filter_scopes_results(self, store) -> None:
store.add(
[
make_node("a1", "1", seed=0.0),
make_node("b1", "2", seed=1.0),
make_node("c1", "3", seed=2.0),
],
)
result = _query(store, [0.0] * DIM, top_k=10, filters=_in_filter(["2", "3"]))
assert sorted(result.ids) == ["b1", "c1"]
def test_query_with_in_filter_scopes_results(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.add([_node("1-0", "1", "a", 0.1), _node("2-0", "2", "b", 0.1)])
def test_query_respects_top_k_with_filter(self, store) -> None:
# k semantics: global top-k even with IN filters (document_id is a
# metadata column, not a partition key -- see design doc).
store.add(
[make_node(f"n{i}", str(i % 4), seed=float(i)) for i in range(12)],
)
result = _query(
store,
[0.0] * DIM,
top_k=3,
filters=_in_filter(["0", "1", "2", "3"]),
)
assert len(result.ids) == 3
assert result.similarities == sorted(result.similarities, reverse=True)
result = store.query(
VectorStoreQuery(
query_embedding=[0.1] * DIM,
similarity_top_k=5,
filters=MetadataFilters(
filters=[
MetadataFilter(
key="document_id",
operator=FilterOperator.IN,
value=["2"],
),
],
def test_get_nodes_filter_and_empty_paths(self, store) -> None:
assert store.get_nodes(filters=_in_filter(["1"])) == [] # no table yet
store.add([make_node("a1", "1"), make_node("b1", "2")])
nodes = store.get_nodes(filters=_in_filter(["1"]))
assert [n.node_id for n in nodes] == ["a1"]
assert nodes[0].embedding is not None
assert store.get_nodes(filters=_in_filter(["999"])) == []
def test_query_with_eq_filter_scopes_results(self, store) -> None:
store.add(
[
make_node("a1", "1", seed=0.0),
make_node("b1", "2", seed=1.0),
make_node("c1", "3", seed=2.0),
],
)
result = _query(
store,
[0.0] * DIM,
top_k=10,
filters=_eq_filter("document_id", "2"),
)
assert result.ids == ["b1"]
def test_get_nodes_node_ids_not_implemented(self, store) -> None:
with pytest.raises(NotImplementedError):
store.get_nodes(node_ids=["x"])
def test_fresh_instance_sees_existing_table(self, store, tmp_path: Path) -> None:
store.add([make_node("a1", "1")])
with PaperlessSqliteVecVectorStore(uri=str(tmp_path)) as reopened:
assert reopened.table_exists()
assert reopened.vector_dim() == DIM
assert _query(reopened, [0.0] * DIM, top_k=1).ids == ["a1"]
def test_table_exists_and_drop(self, store) -> None:
assert not store.table_exists()
store.add([make_node("a1", "1")])
assert store.table_exists()
store.drop_table()
assert not store.table_exists()
assert store.vector_dim() is None
class TestBuildWhere:
def test_fails_closed_when_no_filter_is_translatable(self) -> None:
# A nested MetadataFilters is not a MetadataFilter, so it is skipped.
# With no translatable clauses, the function must fail closed rather
# than emit "()" (invalid SQL) and never widen document access.
nested = MetadataFilters(
filters=[
MetadataFilter(
key="document_id",
operator=FilterOperator.EQ,
value="1",
),
),
],
)
where, params = _build_where(MetadataFilters(filters=[nested]))
assert where == "1 = 0"
assert params == []
assert [n.metadata["document_id"] for n in result.nodes] == ["2"]
def test_query_with_untranslatable_filter_returns_no_rows(self, store) -> None:
store.add([make_node("a1", "1"), make_node("b1", "2")])
nested = MetadataFilters(
filters=[
MetadataFilter(
key="document_id",
operator=FilterOperator.EQ,
value="1",
),
],
)
filters = MetadataFilters(filters=[nested])
# Must not raise (no "WHERE ()") and must return nothing (fail closed).
assert _query(store, [0.0] * DIM, top_k=5, filters=filters).ids == []
assert store.get_nodes(filters=filters) == []
def test_get_nodes_filter_returns_empty_cleanly(
class TestUpsert:
def test_upsert_replaces_and_prunes_stale_chunks(self, store) -> None:
store.add(
[make_node("d1c1", "1"), make_node("d1c2", "1"), make_node("d2c1", "2")],
)
store.upsert_document("1", [make_node("d1new", "1")])
result = _query(store, [0.0] * DIM, top_k=10)
assert sorted(result.ids) == ["d1new", "d2c1"]
def test_upsert_creates_table_when_missing(self, store) -> None:
store.upsert_document("1", [make_node("a1", "1")])
assert _query(store, [0.0] * DIM, top_k=1).ids == ["a1"]
def test_upsert_empty_nodes_removes_document(self, store) -> None:
store.add([make_node("a1", "1"), make_node("b1", "2")])
store.upsert_document("1", [])
assert _query(store, [0.0] * DIM, top_k=10).ids == ["b1"]
def test_upsert_is_atomic_for_concurrent_readers(
self,
store: PaperlessLanceVectorStore,
store,
tmp_path: Path,
) -> None:
store.add([_node("1-0", "1", "a", 0.1)])
nodes = store.get_nodes(
filters=MetadataFilters(
filters=[
MetadataFilter(
key="document_id",
operator=FilterOperator.IN,
value=["999"],
),
],
),
)
assert nodes == []
"""A second connection must never observe document 1 half-replaced."""
store.add([make_node("a1", "1"), make_node("a2", "1")])
with PaperlessSqliteVecVectorStore(uri=str(tmp_path)) as reader:
store.upsert_document("1", [make_node("a3", "1")])
ids = [n.node_id for n in reader.get_nodes(filters=_in_filter(["1"]))]
assert ids == ["a3"]
def test_get_nodes_returns_empty_when_no_table(
self,
store: PaperlessLanceVectorStore,
) -> None:
result = store.get_nodes(
filters=MetadataFilters(
filters=[
MetadataFilter(
key="document_id",
operator=FilterOperator.IN,
value=["1"],
),
],
),
)
assert result == []
def test_fresh_instance_filters_existing_table(
class TestMetadataCoercion:
def test_none_metadata_values_become_empty_strings(self, store) -> None:
node = make_node("a1", "1")
node.metadata["modified"] = None
store.add([node]) # must not raise (vec0 rejects NULL metadata)
assert store.get_modified_times() == {"1": ""}
class TestModelNameTracking:
def test_stored_model_name_none_without_table(self, tmp_path: Path) -> None:
with PaperlessSqliteVecVectorStore(
uri=str(tmp_path),
embed_model_name="model-a",
) as store:
assert store.stored_model_name() is None
def test_model_name_stored_after_add_and_persists(self, tmp_path: Path) -> None:
with PaperlessSqliteVecVectorStore(
uri=str(tmp_path),
embed_model_name="model-a",
) as store:
store.add([make_node("a1", "1")])
assert store.stored_model_name() == "model-a"
with PaperlessSqliteVecVectorStore(uri=str(tmp_path)) as reopened:
assert reopened.stored_model_name() == "model-a"
def test_config_mismatch_semantics(self, tmp_path: Path) -> None:
with PaperlessSqliteVecVectorStore(
uri=str(tmp_path),
embed_model_name="model-a",
) as store:
assert not store.config_mismatch("anything") # no table yet
store.add([make_node("a1", "1")])
assert not store.config_mismatch("model-a")
assert store.config_mismatch("model-b")
def test_config_mismatch_false_when_table_predates_tracking(
self,
tmp_path: Path,
) -> None:
uri = str(tmp_path / "idx")
PaperlessLanceVectorStore(uri=uri).add(
[_node("1-0", "1", "a", 0.1), _node("2-0", "2", "b", 0.1)],
)
reopened = PaperlessLanceVectorStore(uri=uri)
result = reopened.query(
VectorStoreQuery(
query_embedding=[0.1] * DIM,
similarity_top_k=5,
filters=MetadataFilters(
filters=[
MetadataFilter(
key="document_id",
operator=FilterOperator.IN,
value=["1"],
),
],
),
),
)
assert [n.metadata["document_id"] for n in result.nodes] == ["1"]
def test_table_exists_and_drop(
self,
store: PaperlessLanceVectorStore,
) -> None:
assert store.table_exists() is False
store.add([_node("1-0", "1", "a", 0.1)])
assert store.table_exists() is True
assert store.vector_dim() == DIM
store.drop_table()
assert store.table_exists() is False
def test_build_where_or_condition(self) -> None:
from llama_index.core.vector_stores.types import FilterCondition
from paperless_ai.vector_store import _build_where
where = _build_where(
MetadataFilters(
filters=[
MetadataFilter(
key="document_id",
operator=FilterOperator.EQ,
value="1",
),
MetadataFilter(
key="document_id",
operator=FilterOperator.EQ,
value="2",
),
],
condition=FilterCondition.OR,
),
)
assert where == "document_id = '1' OR document_id = '2'"
class TestPaperlessLanceVectorStoreUpsert:
@pytest.fixture
def store(self, tmp_path: Path) -> PaperlessLanceVectorStore:
s = PaperlessLanceVectorStore(uri=str(tmp_path / "idx"))
s.add(
[
_node("1-0", "1", "old0", 0.1),
_node("1-1", "1", "old1", 0.2),
_node("1-2", "1", "old2", 0.3),
_node("2-0", "2", "keep", 0.9),
],
)
return s
def test_upsert_prunes_stale_chunks_and_keeps_others(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.upsert_document(
"1",
[_node("1-0", "1", "new0", 0.1), _node("1-1", "1", "new1", 0.2)],
)
table = store.client.open_table("documents")
doc1 = sorted(
r["id"] for r in table.search().where("document_id = '1'").to_list()
)
assert doc1 == ["1-0", "1-1"] # 1-2 pruned
assert table.count_rows() == 3 # 2 new doc1 + 1 doc2
def test_upsert_is_single_commit(
self,
store: PaperlessLanceVectorStore,
) -> None:
table = store.client.open_table("documents")
before = table.version
store.upsert_document("1", [_node("1-0", "1", "new0", 0.1)])
assert store.client.open_table("documents").version == before + 1
def test_upsert_empty_nodes_removes_document(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.upsert_document("1", [])
table = store.client.open_table("documents")
remaining = sorted(r["document_id"] for r in table.search().to_list())
assert "1" not in remaining
assert "2" in remaining
class TestPaperlessLanceVectorStoreMaintenance:
@pytest.fixture
def store(self, tmp_path: Path) -> PaperlessLanceVectorStore:
return PaperlessLanceVectorStore(uri=str(tmp_path / "idx"))
def test_maybe_create_ann_index_noop_below_threshold(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.add([_node("1-0", "1", "a", 0.1)])
# Threshold far above row count -> no index attempted, no error.
store.maybe_create_ann_index(min_rows=1000)
# Still queryable.
result = store.query(
VectorStoreQuery(query_embedding=[0.1] * DIM, similarity_top_k=1),
)
assert len(result.nodes) == 1
def test_maybe_create_ann_index_non_divisible_dim_falls_back(
self,
store: PaperlessLanceVectorStore,
) -> None:
# DIM=8 is not divisible by the PQ default sub-vectors; must not raise
# and must leave the table queryable (IVF_FLAT fallback or skipped).
for i in range(40):
store.add([_node(f"1-{i}", "1", f"t{i}", float(i))])
store.maybe_create_ann_index(min_rows=10)
result = store.query(
VectorStoreQuery(query_embedding=[1.0] * DIM, similarity_top_k=3),
)
assert len(result.nodes) == 3
def test_compact_reduces_to_single_version(
self,
store: PaperlessLanceVectorStore,
) -> None:
for i in range(5):
store.add([_node(f"1-{i}", "1", f"t{i}", float(i))])
assert len(store.client.open_table("documents").list_versions()) > 1
store.compact(retention_seconds=0)
assert len(store.client.open_table("documents").list_versions()) == 1
def test_upsert_after_optimize_with_scalar_index(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.add(
[
_node("1-0", "1", "old0", 0.1),
_node("1-1", "1", "old1", 0.2),
_node("1-2", "1", "old2", 0.3),
_node("2-0", "2", "keep", 0.9),
],
)
store.ensure_document_id_scalar_index()
store.compact(retention_seconds=0)
store.upsert_document("1", [_node("1-0", "1", "new0", 0.1)])
table = store.client.open_table("documents")
doc1 = sorted(
r["id"] for r in table.search().where("document_id = '1'").to_list()
)
assert doc1 == ["1-0"]
assert table.count_rows() == 2
def test_ensure_scalar_index_is_idempotent(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.add([_node("1-0", "1", "text", 0.5)])
store.ensure_document_id_scalar_index()
# Second call must not raise and must not replace the existing index.
store.ensure_document_id_scalar_index()
assert store._has_index_on("document_id")
def test_ensure_scalar_index_noop_on_empty_store(
self,
store: PaperlessLanceVectorStore,
) -> None:
store.ensure_document_id_scalar_index() # no table yet — must not raise
class TestConfigMismatch:
@pytest.fixture
def uri(self, tmp_path: Path) -> str:
return str(tmp_path / "idx")
def test_stored_model_name_returns_none_when_no_table(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri)
assert store.stored_model_name() is None
def test_model_name_stored_in_schema_after_add(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri, embed_model_name="all-MiniLM-L6-v2")
store.add([_node("1-0", "1", "text", 0.1)])
assert store.stored_model_name() == "all-MiniLM-L6-v2"
def test_model_name_stored_in_schema_after_upsert(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri, embed_model_name="nomic-embed")
store.upsert_document("1", [_node("1-0", "1", "text", 0.1)])
assert store.stored_model_name() == "nomic-embed"
def test_model_name_persists_after_reopen(self, uri: str) -> None:
PaperlessLanceVectorStore(uri=uri, embed_model_name="all-MiniLM-L6-v2").add(
[_node("1-0", "1", "text", 0.1)],
)
reopened = PaperlessLanceVectorStore(uri=uri)
assert reopened.stored_model_name() == "all-MiniLM-L6-v2"
def test_config_mismatch_returns_false_when_no_table(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri)
assert store.config_mismatch("any-model") is False
def test_config_mismatch_returns_false_when_model_matches(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri, embed_model_name="all-MiniLM-L6-v2")
store.add([_node("1-0", "1", "text", 0.1)])
assert store.config_mismatch("all-MiniLM-L6-v2") is False
def test_config_mismatch_returns_true_when_model_differs(self, uri: str) -> None:
store = PaperlessLanceVectorStore(uri=uri, embed_model_name="old-model")
store.add([_node("1-0", "1", "text", 0.1)])
assert store.config_mismatch("new-model") is True
def test_config_mismatch_returns_false_when_no_metadata_stored(
self,
uri: str,
) -> None:
# Tables created before model-name tracking was added have no schema metadata.
# Conservative default: assume compatible rather than force a rebuild.
store = PaperlessLanceVectorStore(uri=uri)
store.add([_node("1-0", "1", "text", 0.1)])
assert store.config_mismatch("any-model") is False
with PaperlessSqliteVecVectorStore(uri=str(tmp_path)) as store: # no model name
store.add([make_node("a1", "1")])
assert not store.config_mismatch("model-a")
class TestGetModifiedTimes:
@pytest.fixture
def store(self, tmp_path: Path) -> PaperlessLanceVectorStore:
return PaperlessLanceVectorStore(uri=str(tmp_path / "idx"))
def _node_with_modified(
self,
node_id: str,
doc_id: str,
modified: str,
) -> TextNode:
node = TextNode(
id_=node_id,
text="text",
metadata={"document_id": doc_id, "modified": modified},
)
node.embedding = [0.1] * DIM
node.relationships = {
NodeRelationship.SOURCE: RelatedNodeInfo(node_id=doc_id),
}
return node
def test_empty_store_returns_empty_dict(
self,
store: PaperlessLanceVectorStore,
) -> None:
def test_empty_store_returns_empty_dict(self, store) -> None:
assert store.get_modified_times() == {}
def test_returns_one_entry_per_document(
self,
store: PaperlessLanceVectorStore,
) -> None:
def test_returns_one_entry_per_document(self, store) -> None:
store.add(
[
self._node_with_modified("1-0", "1", "2024-01-01T00:00:00"),
self._node_with_modified("1-1", "1", "2024-01-01T00:00:00"),
self._node_with_modified("2-0", "2", "2024-06-01T00:00:00"),
make_node("a1", "1", modified="2026-01-01T00:00:00"),
make_node("a2", "1", modified="2026-01-01T00:00:00"),
make_node("b1", "2", modified="2026-02-02T00:00:00"),
],
)
result = store.get_modified_times()
assert result == {
"1": "2024-01-01T00:00:00",
"2": "2024-06-01T00:00:00",
assert store.get_modified_times() == {
"1": "2026-01-01T00:00:00",
"2": "2026-02-02T00:00:00",
}
class TestCompact:
def _bloat_ratio(self, store) -> float:
live = store.client.execute(
"SELECT count(*) FROM documents",
).fetchone()[0]
# vec0 0.1.9 does not accumulate deleted rows in the _rowids shadow
# table, so we track cumulative inserts in index_meta instead.
row = store.client.execute(
"SELECT value FROM index_meta WHERE key = 'total_inserts'",
).fetchone()
total = int(row["value"]) if row else live
return total / max(live, 1)
def _churn(self, store, cycles: int) -> None:
for i in range(cycles):
store.upsert_document(
"1",
[make_node(f"gen{i}-{j}", "1", seed=float(j)) for j in range(20)],
)
def test_compact_noop_below_threshold(self, store) -> None:
store.add([make_node("a1", "1")])
store.compact()
assert _query(store, [0.0] * DIM, top_k=1).ids == ["a1"]
def test_force_compact_preserves_rows_and_metadata(self, store) -> None:
store.add([make_node("a1", "1"), make_node("b1", "2", seed=3.0)])
self._churn(store, 5)
before = {
n.node_id: n.metadata
for n in store.get_nodes(filters=_in_filter(["1", "2"]))
}
store.compact(force=True)
after = {
n.node_id: n.metadata
for n in store.get_nodes(filters=_in_filter(["1", "2"]))
}
assert after == before
assert self._bloat_ratio(store) == pytest.approx(1.0)
# store remains fully usable after the rebuild; use a seed far from all
# existing nodes (gen4-0..gen4-19 have seeds 0..19) so cosine KNN is
# unambiguous at top_k=1.
store.upsert_document("3", [make_node("c1", "3", seed=100.0)])
assert "c1" in _query(store, [100.0] * DIM, top_k=1).ids
def test_auto_compact_triggers_on_churn(self, store) -> None:
store.add([make_node(f"s{j}", "1", seed=float(j)) for j in range(20)])
self._churn(store, 5)
assert self._bloat_ratio(store) > 2
store.compact()
assert self._bloat_ratio(store) == pytest.approx(1.0)
def test_compact_on_missing_table_is_noop(self, store) -> None:
store.compact()
store.compact(force=True)
def test_failed_compact_removes_temp_wal_and_shm(
self,
store,
tmp_path: Path,
monkeypatch,
) -> None:
"""A compact() that raises mid-rebuild must leave no .compact* files.
Normally the sole connection's close() checkpoints the temp WAL away,
but a concurrent reader keeps -wal/-shm alive, so the cleanup must
unlink them explicitly (as the structural-migration path does).
"""
store.add([make_node("a1", "1")])
compact_path = str(tmp_path / DB_FILENAME) + ".compact"
held: list[sqlite3.Connection] = []
def boom(conn: sqlite3.Connection, dim: int) -> None:
# Hold an extra connection so close() of the rebuild connection is
# not the last one -> the temp -wal/-shm survive the checkpoint.
extra = sqlite3.connect(compact_path)
extra.execute("SELECT 1").fetchall()
held.append(extra)
raise RuntimeError("boom")
monkeypatch.setattr(
PaperlessSqliteVecVectorStore,
"_create_vec_table",
staticmethod(boom),
)
try:
with pytest.raises(RuntimeError):
store.compact(force=True)
assert sorted(p.name for p in tmp_path.glob("*.compact*")) == []
finally:
for c in held:
c.close()
class TestDbFile:
def test_single_db_file_in_index_dir(self, store, tmp_path: Path) -> None:
store.add([make_node("a1", "1")])
assert (tmp_path / DB_FILENAME).exists()
def test_wal_mode_enabled(self, store) -> None:
assert (
store.client.execute("PRAGMA journal_mode").fetchone()[0].lower() == "wal"
)
class TestMigrations:
"""Tests for the schema migration machinery."""
def _schema_version(self, store: PaperlessSqliteVecVectorStore) -> int | None:
row = store.client.execute(
"SELECT value FROM index_meta WHERE key = 'schema_version'",
).fetchone()
return int(row[0]) if row else None
def test_new_table_records_schema_version(self, store) -> None:
store.add([make_node("a1", "1")])
assert self._schema_version(store) == SCHEMA_VERSION
def test_check_migrations_no_table_returns_false(self, store) -> None:
assert store.check_and_run_migrations() is False
def test_check_migrations_current_version_returns_false(self, store) -> None:
store.add([make_node("a1", "1")])
assert store.check_and_run_migrations() is False
def test_reembed_migration_returns_true(self, store, tmp_path: Path) -> None:
store.add([make_node("a1", "1")])
migration = Migration(
from_version=1,
to_version=2,
kind="re-embed",
description="test re-embed",
)
MIGRATIONS.append(migration)
try:
from paperless_ai import vector_store as vs_mod
original = vs_mod.SCHEMA_VERSION
vs_mod.SCHEMA_VERSION = 2
result = store.check_and_run_migrations()
finally:
MIGRATIONS.remove(migration)
vs_mod.SCHEMA_VERSION = original
assert result is True
def test_structural_migration_copies_rows_and_updates_version(
self,
store,
tmp_path: Path,
) -> None:
store.add([make_node("a1", "1"), make_node("b1", "2")])
def apply(
src: sqlite3.Connection,
dst: sqlite3.Connection,
dim: int,
) -> None:
dst.execute( # nosemgrep
f"CREATE VIRTUAL TABLE {DEFAULT_TABLE_NAME} USING vec0("
"id TEXT PRIMARY KEY, document_id TEXT, modified TEXT,"
f" +node_content TEXT, embedding float[{dim}] distance_metric=cosine"
")",
)
dst.execute(
"INSERT INTO index_meta (key, value) VALUES ('dim', ?) "
"ON CONFLICT(key) DO UPDATE SET value = excluded.value",
(str(dim),),
)
rows = src.execute(
"SELECT id, document_id, modified, node_content, embedding "
f"FROM {DEFAULT_TABLE_NAME}",
).fetchall()
dst.execute("BEGIN IMMEDIATE")
dst.executemany(
f"INSERT INTO {DEFAULT_TABLE_NAME} "
"(id, document_id, modified, node_content, embedding) "
"VALUES (?, ?, ?, ?, ?)",
[
(
r["id"],
r["document_id"],
r["modified"],
r["node_content"],
bytes(r["embedding"]),
)
for r in rows
],
)
dst.execute(
"INSERT INTO index_meta (key, value) VALUES ('total_inserts', ?) "
"ON CONFLICT(key) DO UPDATE SET value = excluded.value",
(str(len(rows)),),
)
dst.execute("COMMIT")
migration = Migration(
from_version=1,
to_version=2,
kind="structural",
description="test structural",
apply=apply,
)
MIGRATIONS.append(migration)
try:
from paperless_ai import vector_store as vs_mod
original = vs_mod.SCHEMA_VERSION
vs_mod.SCHEMA_VERSION = 2
result = store.check_and_run_migrations()
finally:
MIGRATIONS.remove(migration)
vs_mod.SCHEMA_VERSION = original
assert result is False
assert self._schema_version(store) == 2
ids = {n.node_id for n in store.get_nodes()}
assert ids == {"a1", "b1"}
def test_compact_preserves_schema_version(self, store) -> None:
store.add([make_node("a1", "1")])
assert self._schema_version(store) == SCHEMA_VERSION
store.compact(force=True)
assert self._schema_version(store) == SCHEMA_VERSION
def test_stop_at_reembed_boundary(self, store) -> None:
# Registry: structural v2, re-embed v3, structural v4.
# Only v2 should apply; the re-embed boundary must stop execution
# before v4 runs, and the stored version must stay at 2.
store.add([make_node("a1", "1"), make_node("b1", "2")])
def copy_apply(
src: sqlite3.Connection,
dst: sqlite3.Connection,
dim: int,
) -> None:
dst.execute( # nosemgrep
f"CREATE VIRTUAL TABLE {DEFAULT_TABLE_NAME} USING vec0("
"id TEXT PRIMARY KEY, document_id TEXT, modified TEXT,"
f" +node_content TEXT, embedding float[{dim}] distance_metric=cosine"
")",
)
dst.execute(
"INSERT INTO index_meta (key, value) VALUES ('dim', ?) "
"ON CONFLICT(key) DO UPDATE SET value = excluded.value",
(str(dim),),
)
rows = src.execute(
"SELECT id, document_id, modified, node_content, embedding "
f"FROM {DEFAULT_TABLE_NAME}",
).fetchall()
dst.execute("BEGIN IMMEDIATE")
dst.executemany(
f"INSERT INTO {DEFAULT_TABLE_NAME} "
"(id, document_id, modified, node_content, embedding) "
"VALUES (?, ?, ?, ?, ?)",
[
(
r["id"],
r["document_id"],
r["modified"],
r["node_content"],
bytes(r["embedding"]),
)
for r in rows
],
)
dst.execute("COMMIT")
migrations = [
Migration(
from_version=1,
to_version=2,
kind="structural",
description="v2 structural",
apply=copy_apply,
),
Migration(
from_version=2,
to_version=3,
kind="re-embed",
description="v3 re-embed boundary",
),
Migration(
from_version=3,
to_version=4,
kind="structural",
description="v4 structural - must not run",
apply=copy_apply,
),
]
MIGRATIONS.extend(migrations)
try:
from paperless_ai import vector_store as vs_mod
original = vs_mod.SCHEMA_VERSION
vs_mod.SCHEMA_VERSION = 4
result = store.check_and_run_migrations()
finally:
for m in migrations:
MIGRATIONS.remove(m)
vs_mod.SCHEMA_VERSION = original
assert result is True
assert self._schema_version(store) == 2
@@ -0,0 +1,77 @@
from types import SimpleNamespace
import pytest
import pytest_mock
from django.contrib.auth.models import User
from rest_framework.test import APIClient
from documents.models import Document
from documents.tests.factories import DocumentFactory
@pytest.mark.django_db
class TestSuggestionsHintWiring:
@pytest.fixture
def document(self) -> Document:
return DocumentFactory() # type: ignore[return-value]
@pytest.fixture
def api_client(self, admin_user: User) -> APIClient:
client = APIClient()
client.force_authenticate(user=admin_user)
return client
def test_hints_passed_to_classifier_and_matchers(
self,
api_client: APIClient,
document: Document,
mocker: pytest_mock.MockerFixture,
) -> None:
hints = {
"tags": ["Bloodwork"],
"document_types": [],
"correspondents": [],
"storage_paths": [],
}
mocker.patch(
"documents.views.get_taxonomy_hints_for_document",
return_value=hints,
)
mocker.patch(
"documents.views.AIConfig",
return_value=SimpleNamespace(
ai_enabled=True,
llm_backend="ollama",
llm_output_language=None,
),
)
# No cached suggestion -> the view reaches the classifier path.
mocker.patch(
"documents.views.get_llm_suggestion_cache",
return_value=None,
)
mocker.patch("documents.views.set_llm_suggestions_cache")
classify = mocker.patch(
"documents.views.get_ai_document_classification",
return_value={
"title": "Doc",
"tags": ["Bloodwork"],
"correspondents": [],
"document_types": [],
"storage_paths": [],
"dates": [],
},
)
match_tags = mocker.patch(
"documents.views.match_tags_by_name",
return_value=[],
)
mocker.patch("documents.views.match_correspondents_by_name", return_value=[])
mocker.patch("documents.views.match_document_types_by_name", return_value=[])
mocker.patch("documents.views.match_storage_paths_by_name", return_value=[])
response = api_client.get(f"/api/documents/{document.pk}/ai_suggestions/")
assert response.status_code == 200
assert classify.call_args.kwargs["hints"] == hints
assert match_tags.call_args.kwargs["hinted_names"] == {"Bloodwork"}
+447 -176
View File
@@ -1,15 +1,25 @@
import json
import logging
import sqlite3
import struct
from collections.abc import Callable
from collections.abc import Iterator
from collections.abc import Sequence
from contextlib import contextmanager
from dataclasses import dataclass
from dataclasses import field
from pathlib import Path
from types import TracebackType
from typing import Any
from typing import Literal
import lancedb
import pyarrow as pa
import sqlite_vec
from llama_index.core.bridge.pydantic import PrivateAttr
from llama_index.core.schema import BaseNode
from llama_index.core.vector_stores.types import BasePydanticVectorStore
from llama_index.core.vector_stores.types import FilterCondition
from llama_index.core.vector_stores.types import FilterOperator
from llama_index.core.vector_stores.types import MetadataFilter
from llama_index.core.vector_stores.types import MetadataFilters
from llama_index.core.vector_stores.types import VectorStoreQuery
from llama_index.core.vector_stores.types import VectorStoreQueryResult
@@ -18,46 +28,118 @@ from llama_index.core.vector_stores.utils import node_to_metadata_dict
logger = logging.getLogger("paperless_ai.vector_store")
DB_FILENAME = "llmindex.db"
DEFAULT_TABLE_NAME = "documents"
# Below this many chunks, LanceDB's exact (brute-force) search is sufficient and
# faster than building an ANN index (per LanceDB guidance, ~100K vectors).
ANN_INDEX_MIN_ROWS = 100_000
# IVF_PQ default; num_sub_vectors must evenly divide the embedding dimension.
ANN_PQ_SUB_VECTORS = 96
# Current schema version. Written to index_meta at table creation and bumped
# whenever a Migration is added to MIGRATIONS. check_and_run_migrations() uses
# this to decide which migrations to run on an existing store.
SCHEMA_VERSION = 1
# compact(): rebuild when the cumulative rowid count exceeds this multiple of
# the live row count. DELETEs on vec0 tables never reclaim space (upstream
# asg017/sqlite-vec#54), so per-document re-index churn grows the file until
# a rebuild copies the live rows into a fresh table.
COMPACT_BLOAT_RATIO = 2.0
# Filterable vec0 metadata columns. _build_where() only ever receives filter
# keys we construct ourselves, but allowlisting keeps SQL identifiers safe by
# construction.
_FILTER_COLUMNS = frozenset({"document_id", "modified"})
def _escape(value: str) -> str:
return str(value).replace("'", "''")
@dataclass
class Migration:
"""A schema migration for the sqlite-vec vector store.
kind="structural": rows are copied into a new-schema file with no
re-embedding needed. Supply ``apply(src_conn, dst_conn, dim)`` which
must create the vec0 table in ``dst_conn``, copy all rows from
``src_conn``, and write ``dim`` / ``embed_model`` / ``total_inserts`` to
``dst_conn``'s ``index_meta``. ``schema_version`` is written by the
migration runner after ``apply`` returns.
kind="re-embed": the new schema requires fresh embeddings.
``check_and_run_migrations()`` returns True when it encounters one of
these so the caller can force a full rebuild (which recreates the table
at the current SCHEMA_VERSION).
"""
from_version: int
to_version: int
kind: Literal["structural", "re-embed"]
description: str
apply: Callable[[sqlite3.Connection, sqlite3.Connection, int], None] | None = field(
default=None,
repr=False,
)
def _build_where(filters: MetadataFilters | None) -> str | None:
"""Translate the EQ / IN filters we use into a Lance SQL predicate on the
top-level ``document_id`` column."""
# Registry of all schema migrations in order. Empty at v1 -- this is the
# baseline. Add entries here (and bump SCHEMA_VERSION) when the schema changes.
MIGRATIONS: list[Migration] = []
def _pack(embedding: Sequence[float]) -> bytes:
return struct.pack(f"{len(embedding)}f", *embedding)
def _unpack(blob: bytes) -> list[float]:
return list(struct.unpack(f"{len(blob) // 4}f", blob))
def _build_where(filters: MetadataFilters | None) -> tuple[str, list[str]]:
"""Translate the EQ / IN filters we use into a parameterized SQL clause
on vec0 metadata columns. Returns ("", []) when there is nothing to filter.
"""
if filters is None or not filters.filters:
return None
return "", []
clauses: list[str] = []
params: list[str] = []
for f in filters.filters:
# filters.filters is Union[MetadataFilter, ExactMatchFilter, MetadataFilters];
# we only build MetadataFilter entries, so skip anything else at runtime.
if not isinstance(f, MetadataFilter):
continue
if f.key not in _FILTER_COLUMNS: # pragma: no cover - we build the keys
raise NotImplementedError(f"Unsupported filter column: {f.key}")
if f.operator == FilterOperator.IN:
vals = ",".join(f"'{_escape(v)}'" for v in f.value)
clauses.append(f"{f.key} IN ({vals})")
values = [str(v) for v in f.value] # type: ignore[union-attr] # value is list when operator is IN
if not values: # pragma: no cover
clauses.append("1 = 0")
continue
placeholders = ",".join("?" for _ in values)
clauses.append(f"{f.key} IN ({placeholders})")
params.extend(values)
elif f.operator == FilterOperator.EQ:
clauses.append(f"{f.key} = '{_escape(f.value)}'")
clauses.append(f"{f.key} = ?")
params.append(str(f.value))
else: # pragma: no cover - we only ever build EQ/IN filters
raise NotImplementedError(f"Unsupported filter operator: {f.operator}")
if not clauses:
# Filters were requested but none could be translated. Fail closed
# rather than emit "()" (invalid SQL): filters scope document access,
# so an empty translation must match no rows, never widen the scope.
return "1 = 0", []
joiner = " OR " if filters.condition == FilterCondition.OR else " AND "
return joiner.join(clauses)
return "(" + joiner.join(clauses) + ")", params
class PaperlessLanceVectorStore(BasePydanticVectorStore):
"""A llama-index vector store backed directly by a LanceDB table.
class PaperlessSqliteVecVectorStore(BasePydanticVectorStore):
"""A llama-index vector store backed by a sqlite-vec vec0 table.
Stores one row per node with the node id, its document id (both as the
``ref_doc_id`` delete key ``doc_id`` and a top-level filter column
``document_id``), the embedding, and the serialised node (text + metadata)
as JSON. ``stores_text`` lets llama-index run off this store alone, with no
Stores one row per node: the node id (TEXT primary key), its document id
(metadata column, used for EQ/IN filtering and per-document delete), the
document's modified timestamp, the embedding (float32, cosine metric), and
the serialized node (text + metadata) as JSON in an auxiliary column.
``stores_text`` lets llama-index run off this store alone, with no
separate docstore or index store.
Everything lives in one SQLite database file (``DB_FILENAME``) inside the
directory given as ``uri`` (kept as a directory for compatibility with the
previous LanceDB layout). WAL mode allows readers in other processes to
proceed while the (FileLock-serialized) writer holds a transaction.
Implemented surface of ``BasePydanticVectorStore``
---------------------------------------------------
Only the methods actively used by this codebase are implemented.
@@ -70,58 +152,117 @@ class PaperlessLanceVectorStore(BasePydanticVectorStore):
flat_metadata: bool = False
_uri: str = PrivateAttr()
_table_name: str = PrivateAttr()
_embed_model_name: str | None = PrivateAttr()
_conn: Any = PrivateAttr()
_table: Any = PrivateAttr()
def __init__(
self,
uri: str,
table_name: str = DEFAULT_TABLE_NAME,
embed_model_name: str | None = None,
) -> None:
super().__init__(stores_text=True, flat_metadata=False)
self._uri = uri
self._table_name = table_name
self._embed_model_name = embed_model_name
self._conn = lancedb.connect(uri)
existing = self._conn.list_tables().tables
self._table = (
self._conn.open_table(table_name) if table_name in existing else None
self._conn = self._open_connection(str(Path(uri) / DB_FILENAME))
@staticmethod
def _open_connection(db_path: str) -> sqlite3.Connection:
conn = sqlite3.connect(
db_path,
timeout=30,
isolation_level=None, # autocommit; explicit transactions below
)
conn.row_factory = sqlite3.Row
conn.enable_load_extension(True) # noqa: FBT003
sqlite_vec.load(conn)
conn.enable_load_extension(False) # noqa: FBT003
conn.execute("PRAGMA journal_mode=WAL")
conn.execute("PRAGMA synchronous=NORMAL")
conn.execute(
"CREATE TABLE IF NOT EXISTS index_meta (key TEXT PRIMARY KEY, value TEXT)",
)
return conn
@property
def client(self) -> Any:
return self._conn
def close(self) -> None:
"""Close the underlying SQLite connection (idempotent)."""
self._conn.close()
def __enter__(self) -> "PaperlessSqliteVecVectorStore":
return self
def __exit__(
self,
exc_type: type[BaseException] | None,
exc_val: BaseException | None,
exc_tb: TracebackType | None,
) -> None:
# Deterministically release the connection (and its WAL/SHM handles) so
# it is never left open across a compaction/migration file swap.
self.close()
@contextmanager
def _transaction(self) -> Iterator[None]:
self._conn.execute("BEGIN IMMEDIATE")
try:
yield
except BaseException: # pragma: no cover
self._conn.execute("ROLLBACK")
raise
else:
self._conn.execute("COMMIT")
def _meta_get(self, key: str) -> str | None:
row = self._conn.execute(
"SELECT value FROM index_meta WHERE key = ?",
(key,),
).fetchone()
return row["value"] if row else None
@staticmethod
def _meta_set_on(conn: sqlite3.Connection, key: str, value: str) -> None:
conn.execute(
"INSERT INTO index_meta (key, value) VALUES (?, ?) "
"ON CONFLICT(key) DO UPDATE SET value = excluded.value",
(key, value),
)
def _meta_set(self, key: str, value: str) -> None:
self._meta_set_on(self._conn, key, value)
def table_exists(self) -> bool:
return self._table is not None
return (
self._conn.execute(
"SELECT 1 FROM sqlite_master WHERE type = 'table' AND name = ?",
(DEFAULT_TABLE_NAME,),
).fetchone()
is not None
)
def vector_dim(self) -> int | None:
if self._table is None:
if not self.table_exists():
return None
return self._table.schema.field("vector").type.list_size
value = self._meta_get("dim")
return int(value) if value else None
def drop_table(self) -> None:
if self.table_exists():
self._conn.drop_table(self._table_name)
self._table = None
self._conn.execute("DROP TABLE IF EXISTS " + DEFAULT_TABLE_NAME)
self._conn.execute("DELETE FROM index_meta")
def stored_model_name(self) -> str | None:
"""Return the embedding model name stored in table schema metadata, or None."""
if self._table is None:
"""Return the embedding model name recorded at table creation, or None."""
if not self.table_exists():
return None
meta = self._table.schema.metadata or {}
value = meta.get(b"embed_model")
return value.decode() if value else None
return self._meta_get("embed_model")
def config_mismatch(self, model_name: str) -> bool:
"""True when the stored model name differs from ``model_name``.
Returns False when no table exists or when the table predates model-name
tracking (schema has no metadata) — conservative default avoids spurious
rebuilds on upgrade.
Returns False when no table exists or when the table predates
model-name tracking — conservative default avoids spurious rebuilds.
"""
stored = self.stored_model_name()
if stored is None:
@@ -129,97 +270,115 @@ class PaperlessLanceVectorStore(BasePydanticVectorStore):
return stored != model_name
@staticmethod
def _schema(dim: int, model_name: str | None = None) -> pa.Schema:
meta = {b"embed_model": model_name.encode()} if model_name else None
return pa.schema(
[
pa.field("id", pa.string()),
pa.field("doc_id", pa.string()),
pa.field("document_id", pa.string()),
pa.field("modified", pa.string()),
pa.field("vector", pa.list_(pa.float32(), dim)),
pa.field("node_content", pa.string()),
],
metadata=meta,
def _create_vec_table(conn: sqlite3.Connection, dim: int) -> None:
# document_id is deliberately a metadata column, NOT a partition key:
# partition keys change KNN `k` to per-partition semantics under IN
# filters (asg017/sqlite-vec#142); metadata columns give a correct
# global top-k.
conn.execute( # nosemgrep: python.sqlalchemy.security.sqlalchemy-execute-raw-query.sqlalchemy-execute-raw-query
"CREATE VIRTUAL TABLE "
+ DEFAULT_TABLE_NAME
+ " USING vec0("
+ "id TEXT PRIMARY KEY,"
+ " document_id TEXT,"
+ " modified TEXT,"
+ " +node_content TEXT,"
+ " embedding float["
+ str(int(dim))
+ "] distance_metric=cosine"
+ ")",
)
def _row(self, node: BaseNode) -> dict[str, Any]:
def _create_table(self, dim: int) -> None:
self._create_vec_table(self._conn, dim)
self._meta_set("dim", str(dim))
self._meta_set("schema_version", str(SCHEMA_VERSION))
if self._embed_model_name:
self._meta_set("embed_model", self._embed_model_name)
def _ensure_table(self, dim: int) -> None:
if not self.table_exists():
self._create_table(dim)
def _row(self, node: BaseNode) -> tuple[str, str, str, str, bytes]:
meta = node_to_metadata_dict(
node,
remove_text=False,
flat_metadata=self.flat_metadata,
)
return {
"id": node.node_id,
"doc_id": node.ref_doc_id,
"document_id": str(node.metadata.get("document_id")),
"modified": str(node.metadata.get("modified", "")),
"vector": node.get_embedding(),
"node_content": json.dumps(meta),
}
def _ensure_table(self, rows: list[dict[str, Any]], dim: int) -> bool:
"""Create the table from ``rows`` if it does not exist yet.
Returns True if the table was just created (caller can skip the
separate add/merge step), False if the table already existed.
"""
if self._table is not None:
return False
self._table = self._conn.create_table(
self._table_name,
rows,
schema=self._schema(dim, self._embed_model_name),
# vec0 metadata columns reject NULL (asg017/sqlite-vec#141): coerce
# every value to a string, with "" as the absent sentinel.
document_id = node.ref_doc_id or node.metadata.get("document_id")
return (
node.node_id,
str(document_id or ""),
str(node.metadata.get("modified") or ""),
json.dumps(meta),
_pack(node.get_embedding()),
)
return True
_INSERT = (
"INSERT INTO "
+ DEFAULT_TABLE_NAME
+ " (id, document_id, modified, node_content, embedding) VALUES (?, ?, ?, ?, ?)"
)
def _increment_total_inserts(self, count: int) -> None:
"""Increment the cumulative insert counter stored in index_meta.
This counter never decreases (DELETEs do not decrement it) and is
used by compact() to estimate the bloat ratio: when total_inserts /
live_rows exceeds COMPACT_BLOAT_RATIO the table has accumulated
enough deleted-but-not-freed rows to warrant a rebuild.
"""
current = int(self._meta_get("total_inserts") or "0")
self._meta_set("total_inserts", str(current + count))
def add(self, nodes: Sequence[BaseNode], **add_kwargs: Any) -> list[str]:
if not nodes:
return []
rows = [self._row(node) for node in nodes]
dim = len(nodes[0].get_embedding())
if not self._ensure_table(rows, dim):
self._table.add(rows)
with self._transaction():
self._ensure_table(len(nodes[0].get_embedding()))
self._conn.executemany(self._INSERT, rows)
self._increment_total_inserts(len(rows))
return [node.node_id for node in nodes]
def upsert_document(self, document_id: str, nodes: list[BaseNode]) -> list[str]:
"""Atomically replace all stored chunks of ``document_id`` with ``nodes``.
A single ``merge_insert`` commit: matching node ids are updated, new ids
inserted, and any existing rows for this document that are not in the new
set are deleted (``when_not_matched_by_source_delete``). This prunes stale
trailing chunks when an edit reduces a document's chunk count, with no
transient empty state for concurrent lock-free readers.
One transaction deletes the document's existing rows and inserts the
new set (vec0's INSERT OR REPLACE is broken upstream, #259, so
delete+insert it is). WAL readers in other processes see either the
old or the new chunk set, never a partial state.
"""
if not nodes:
# No indexable content: remove any existing chunks for this document.
if self._table is not None:
self._table.delete(f"document_id = '{_escape(document_id)}'")
return []
rows = [self._row(node) for node in nodes]
dim = len(nodes[0].get_embedding())
if self._ensure_table(rows, dim):
return [node.node_id for node in nodes]
(
self._table.merge_insert("id")
.when_matched_update_all()
.when_not_matched_insert_all()
.when_not_matched_by_source_delete(
f"document_id = '{_escape(document_id)}'",
)
.execute(rows)
)
with self._transaction():
if nodes:
self._ensure_table(len(nodes[0].get_embedding()))
if self.table_exists():
self._conn.execute(
"DELETE FROM " + DEFAULT_TABLE_NAME + " WHERE document_id = ?",
(str(document_id),),
)
if rows:
self._conn.executemany(self._INSERT, rows)
self._increment_total_inserts(len(rows))
return [node.node_id for node in nodes]
def delete(self, ref_doc_id: str, **delete_kwargs: Any) -> None:
if self._table is not None:
self._table.delete(f"doc_id = '{_escape(ref_doc_id)}'")
if self.table_exists():
with self._transaction():
self._conn.execute(
"DELETE FROM " + DEFAULT_TABLE_NAME + " WHERE document_id = ?",
(str(ref_doc_id),),
)
def _rows_to_nodes(self, rows: list[dict[str, Any]]) -> list[BaseNode]:
def _rows_to_nodes(self, rows: list[sqlite3.Row]) -> list[BaseNode]:
nodes: list[BaseNode] = []
for row in rows:
node = metadata_dict_to_node(json.loads(row["node_content"]))
node.embedding = list(row["vector"])
node.embedding = _unpack(row["embedding"])
nodes.append(node)
return nodes
@@ -232,102 +391,214 @@ class PaperlessLanceVectorStore(BasePydanticVectorStore):
if node_ids is not None: # pragma: no cover
# node_ids lookup is not implemented; see class docstring.
raise NotImplementedError(
"PaperlessLanceVectorStore does not support node_ids lookup",
"PaperlessSqliteVecVectorStore does not support node_ids lookup",
)
if self._table is None:
if not self.table_exists():
return []
where = _build_where(filters)
query = self._table.search()
where, params = _build_where(filters)
sql = "SELECT node_content, embedding FROM " + DEFAULT_TABLE_NAME
if where:
query = query.where(where)
return self._rows_to_nodes(query.to_list())
sql += " WHERE " + where
return self._rows_to_nodes(self._conn.execute(sql, params).fetchall())
def query(
self,
query: VectorStoreQuery,
**kwargs: Any,
) -> VectorStoreQueryResult:
if self._table is None:
if not self.table_exists():
return VectorStoreQueryResult(nodes=[], similarities=[], ids=[])
if query.query_embedding is None: # pragma: no cover
return VectorStoreQueryResult(nodes=[], similarities=[], ids=[])
top_k = query.similarity_top_k if query.similarity_top_k is not None else 10
search = self._table.search(query.query_embedding).limit(top_k)
where = _build_where(query.filters)
where, params = _build_where(query.filters)
sql = (
"SELECT id, node_content, embedding, distance FROM "
+ DEFAULT_TABLE_NAME
+ " WHERE embedding MATCH ? AND k = ?"
)
if where:
search = search.where(where)
rows = search.to_list()
sql += " AND " + where
rows = self._conn.execute(
sql,
[_pack(query.query_embedding), top_k, *params],
).fetchall()
# vec0 returns rows distance-sorted ascending; slice defensively in
# case future schema changes alter k semantics (e.g. partition keys
# return k rows per partition).
rows = rows[:top_k]
nodes = self._rows_to_nodes(rows)
# LanceDB returns an L2 distance (smaller = closer); map to a descending similarity.
sims = [1.0 / (1.0 + float(row["_distance"])) for row in rows]
# Cosine distance in [0, 2]; map to a descending similarity.
# vec0 returns None distance when the query embedding is the zero vector
# (no meaningful cosine angle); treat that as maximum distance (1.0) so
# the row is included but ranked last.
sims = [
1.0 - float(row["distance"] if row["distance"] is not None else 1.0)
for row in rows
]
ids = [row["id"] for row in rows]
return VectorStoreQueryResult(nodes=nodes, similarities=sims, ids=ids)
def _has_index_on(self, column: str) -> bool:
return any(column in idx.columns for idx in self._table.list_indices())
def maybe_create_ann_index(self, min_rows: int = ANN_INDEX_MIN_ROWS) -> None:
"""Best-effort: build an IVF index once the table is large enough.
IVF_PQ is used when ``num_sub_vectors`` divides the embedding dimension,
otherwise IVF_FLAT (no divisor constraint). Any failure is logged and
leaves the table on exact search, which is always correct.
"""
if self._table is None:
return
rows = self._table.count_rows()
if rows < min_rows or self._has_index_on("vector"):
return
num_partitions = max(1, rows // 4096)
# Embedding dim from the schema's fixed-size list column.
dim = self._table.schema.field("vector").type.list_size
try:
if dim % ANN_PQ_SUB_VECTORS == 0: # pragma: no cover
self._table.create_index(
metric="l2",
num_partitions=num_partitions,
num_sub_vectors=ANN_PQ_SUB_VECTORS,
index_type="IVF_PQ",
)
else:
self._table.create_index(
metric="l2",
num_partitions=num_partitions,
index_type="IVF_FLAT",
)
except Exception as e: # pragma: no cover - depends on data/dim
logger.warning("Skipping ANN index creation: %s", e)
def get_modified_times(self) -> dict[str, str]:
"""Return {document_id: stored_modified_isoformat} for all indexed documents.
One representative chunk per document is fetched; all chunks share the
same ``modified`` value so the first one seen is sufficient.
All chunks of a document share the same ``modified`` value, so the
first row seen per document is sufficient.
"""
if self._table is None:
if not self.table_exists():
return {}
result: dict[str, str] = {}
for row in self._table.search().select(["document_id", "modified"]).to_list():
for row in self._conn.execute(
"SELECT document_id, modified FROM " + DEFAULT_TABLE_NAME,
):
doc_id = str(row["document_id"])
if doc_id not in result:
result[doc_id] = str(row["modified"] or "")
return result
def ensure_document_id_scalar_index(self) -> None:
"""Create a scalar index on the filter column (never on the merge key
``id`` — see https://github.com/lancedb/lancedb/issues/3177).
No-op if the index already exists."""
if self._table is None:
def compact(self, *, force: bool = False) -> None:
"""Rebuild the database file to reclaim space left behind by DELETEs.
vec0 DELETE only invalidates rows; the vector data stays in the file
forever (asg017/sqlite-vec#54), and per-document re-indexing is a
delete+insert. The cumulative insert counter in ``index_meta`` tracks
total rows ever written; when that exceeds ``COMPACT_BLOAT_RATIO`` x
the live row count (or when forced), live rows are copied into a fresh
database file and swapped in via ``os.replace``.
Note: ``ALTER TABLE ... RENAME TO`` on vec0 virtual tables does NOT
rename the shadow tables (sqlite-vec upstream limitation), so
an in-place rename-based rebuild is not safe. The file-swap approach
is the maintainer-endorsed workaround (asg017/sqlite-vec#205).
"""
if not self.table_exists():
return
if self._has_index_on("document_id"):
live = self._conn.execute(
"SELECT count(*) FROM " + DEFAULT_TABLE_NAME,
).fetchone()[0]
total = int(self._meta_get("total_inserts") or str(live))
if not force and total <= max(live, 1) * COMPACT_BLOAT_RATIO:
return
dim = self.vector_dim()
if dim is None: # pragma: no cover - dim is written at creation
logger.warning("Skipping compact: no stored vector dimension")
return
logger.info(
"Compacting LLM index (%d live rows, %d cumulative inserts)",
live,
total,
)
db_path = str(Path(self._uri) / DB_FILENAME)
compact_path = db_path + ".compact"
# Copy all live rows into a fresh database file.
new_conn = self._open_connection(compact_path)
try:
self._table.create_scalar_index("document_id")
except Exception as e: # pragma: no cover
logger.warning("Skipping document_id scalar index: %s", e)
self._create_vec_table(new_conn, dim)
self._meta_set_on(new_conn, "dim", str(dim))
for key in ("embed_model", "schema_version"):
value = self._meta_get(key)
if value is not None:
self._meta_set_on(new_conn, key, value)
rows = self._conn.execute(
"SELECT id, document_id, modified, node_content, embedding "
"FROM " + DEFAULT_TABLE_NAME,
).fetchall()
new_conn.execute("BEGIN IMMEDIATE")
new_conn.executemany(
self._INSERT,
[
(
r["id"],
r["document_id"],
r["modified"],
r["node_content"],
bytes(r["embedding"]),
)
for r in rows
],
)
# Reset the cumulative counter: after compact, total_inserts == live.
self._meta_set_on(new_conn, "total_inserts", str(live))
new_conn.execute("COMMIT")
except BaseException:
new_conn.close()
for p in [compact_path, compact_path + "-wal", compact_path + "-shm"]:
Path(p).unlink(missing_ok=True)
raise
new_conn.close()
self._swap_in_compact(compact_path, db_path)
def compact(self, retention_seconds: int) -> None:
"""Compact fragments and prune old MVCC versions in one call."""
if self._table is None:
return
from datetime import timedelta
def _swap_in_compact(self, compact_path: str, db_path: str) -> None:
"""Atomically replace the live database with the compacted copy."""
self._conn.close()
for suffix in ["-wal", "-shm"]:
stale = Path(compact_path + suffix)
if stale.exists(): # pragma: no cover
stale.unlink()
Path(compact_path).replace(db_path)
self._conn = self._open_connection(db_path)
self._table.optimize(cleanup_older_than=timedelta(seconds=retention_seconds))
def check_and_run_migrations(self) -> bool:
"""Apply any pending schema migrations to the store.
Structural migrations copy live rows into a new-schema file with no
re-embedding. Re-embed migrations cannot be applied automatically;
this method returns True when one is encountered so the caller can
force a full rebuild (which recreates the table at SCHEMA_VERSION).
Must be called under the write FileLock. No-op when the table does
not exist or is already at SCHEMA_VERSION.
"""
if not self.table_exists():
return False
raw = self._meta_get("schema_version")
current = int(raw) if raw is not None else SCHEMA_VERSION
if current >= SCHEMA_VERSION:
return False
pending = sorted(
[m for m in MIGRATIONS if current <= m.from_version < SCHEMA_VERSION],
key=lambda m: m.from_version,
)
for migration in pending:
if migration.kind == "re-embed":
logger.warning(
"LLM index schema v%d -> v%d requires re-embedding (%s); "
"forcing full rebuild.",
migration.from_version,
migration.to_version,
migration.description,
)
return True
logger.info(
"Running structural LLM index migration v%d -> v%d: %s",
migration.from_version,
migration.to_version,
migration.description,
)
self._run_structural_migration(migration)
return False
def _run_structural_migration(self, migration: Migration) -> None:
"""Execute a structural migration using the same file-swap as compact()."""
assert migration.apply is not None, "structural migration must have apply()"
dim = self.vector_dim()
if dim is None: # pragma: no cover
raise RuntimeError("Cannot migrate: no stored vector dimension")
db_path = str(Path(self._uri) / DB_FILENAME)
compact_path = db_path + ".compact"
new_conn = self._open_connection(compact_path)
try:
migration.apply(self._conn, new_conn, dim)
self._meta_set_on(new_conn, "schema_version", str(migration.to_version))
except BaseException: # pragma: no cover
new_conn.close()
for p in [compact_path, compact_path + "-wal", compact_path + "-shm"]:
Path(p).unlink(missing_ok=True)
raise
new_conn.close()
self._swap_in_compact(compact_path, db_path)
+10 -5
View File
@@ -4,6 +4,7 @@ import logging
import ssl
import tempfile
import traceback
import unicodedata
from datetime import date
from datetime import timedelta
from fnmatch import fnmatch
@@ -496,10 +497,10 @@ class MailAccountHandler(LoggingMixin):
rule: MailRule,
) -> str | None:
if rule.assign_title_from == MailRule.TitleSource.FROM_SUBJECT:
return message.subject
return unicodedata.normalize("NFC", message.subject)
elif rule.assign_title_from == MailRule.TitleSource.FROM_FILENAME:
return Path(att.filename).stem
return unicodedata.normalize("NFC", Path(att.filename).stem)
elif rule.assign_title_from == MailRule.TitleSource.NONE:
return None
@@ -866,7 +867,9 @@ class MailAccountHandler(LoggingMixin):
),
)
attachment_name = pathvalidate.sanitize_filename(att.filename)
attachment_name = pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", att.filename),
)
if attachment_name:
temp_filename = temp_dir / attachment_name
else: # pragma: no cover
@@ -882,7 +885,7 @@ class MailAccountHandler(LoggingMixin):
)
doc_overrides = DocumentMetadataOverrides(
title=title,
filename=pathvalidate.sanitize_filename(att.filename),
filename=attachment_name,
correspondent_id=correspondent.id if correspondent else None,
document_type_id=doc_type.id if doc_type else None,
tag_ids=tag_ids,
@@ -988,7 +991,9 @@ class MailAccountHandler(LoggingMixin):
)
doc_overrides = DocumentMetadataOverrides(
title=message.subject,
filename=pathvalidate.sanitize_filename(f"{message.subject}.eml"),
filename=pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", f"{message.subject}.eml"),
),
correspondent_id=correspondent.id if correspondent else None,
document_type_id=doc_type.id if doc_type else None,
tag_ids=tag_ids,
+182
View File
@@ -0,0 +1,182 @@
"""
Tests that mail attachment filenames and EML subject filenames are
normalized to NFC Unicode before being stored as document overrides.
Filenames from MIME headers can arrive in NFD form (e.g. from macOS Mail),
and must be normalized to NFC so filenames are consistent regardless of the
sending client.
"""
import unicodedata
from pathlib import Path
from unittest import mock
import pytest
from documents.tests.utils import remove_dirs
from documents.tests.utils import setup_directories
from paperless_mail.models import MailRule
from paperless_mail.tests.factories import MailAccountFactory
from paperless_mail.tests.test_mail import MessageBuilder
from paperless_mail.tests.test_mail import _AttachmentDef
from paperless_mail.tests.test_mail import fake_magic_from_buffer
@pytest.fixture()
def directories(settings):
dirs = setup_directories()
yield dirs
remove_dirs(dirs)
@pytest.fixture()
def queue_consumption_tasks_mock():
with mock.patch("paperless_mail.mail.queue_consumption_tasks") as m:
yield m
@pytest.fixture()
def mail_account(db):
return MailAccountFactory()
@pytest.fixture()
def attachment_rule(mail_account):
rule = MailRule(
name="attachment rule",
account=mail_account,
assign_title_from=MailRule.TitleSource.FROM_FILENAME,
consumption_scope=MailRule.ConsumptionScope.ATTACHMENTS_ONLY,
attachment_type=MailRule.AttachmentProcessing.ATTACHMENTS_ONLY,
)
rule.save()
return rule
@pytest.fixture()
def eml_rule(mail_account):
rule = MailRule(
name="eml rule",
account=mail_account,
assign_title_from=MailRule.TitleSource.FROM_SUBJECT,
consumption_scope=MailRule.ConsumptionScope.EML_ONLY,
attachment_type=MailRule.AttachmentProcessing.ATTACHMENTS_ONLY,
)
rule.save()
return rule
@pytest.fixture()
def message_builder():
return MessageBuilder()
@pytest.mark.django_db
@mock.patch("paperless_mail.mail.magic.from_buffer", fake_magic_from_buffer)
class TestMailNFCNormalization:
"""Attachment filenames and EML subject filenames must be NFC-normalized."""
def test_attachment_nfd_filename_normalized_to_nfc(
self,
directories,
queue_consumption_tasks_mock,
attachment_rule,
mail_account_handler,
message_builder,
):
"""Attachment filename arriving as NFD must be stored as NFC in both
the overrides and the temp file written to disk.
"""
nfd_filename = unicodedata.normalize("NFD", "Rechnung März.pdf")
nfc_filename = unicodedata.normalize("NFC", "Rechnung März.pdf")
# Confirm the fixture is actually NFD (not already NFC)
assert unicodedata.is_normalized("NFD", nfd_filename)
assert not unicodedata.is_normalized("NFC", nfd_filename)
message = message_builder.create_message(
subject="Test invoice",
from_="sender@example.com",
attachments=[
_AttachmentDef(filename=nfd_filename, content=b"%PDF-1.4 test"),
],
)
result = mail_account_handler._handle_message(message, attachment_rule)
assert result == 1
queue_consumption_tasks_mock.assert_called_once()
call_kwargs = queue_consumption_tasks_mock.call_args.kwargs
consume_tasks = call_kwargs["consume_tasks"]
assert len(consume_tasks) == 1
overrides = consume_tasks[0].kwargs["overrides"]
assert overrides.filename == nfc_filename
assert unicodedata.is_normalized("NFC", overrides.filename)
assert unicodedata.is_normalized("NFC", overrides.title)
input_doc = consume_tasks[0].kwargs["input_doc"]
original_file = Path(input_doc.original_file)
assert original_file.exists()
assert original_file.name == nfc_filename
def test_eml_subject_filename_nfc(
self,
directories,
queue_consumption_tasks_mock,
eml_rule,
mail_account_handler,
message_builder,
):
"""EML filename derived from subject arriving as NFD must be stored as NFC."""
nfd_subject = unicodedata.normalize("NFD", "Rechnung März 2024")
nfc_expected_filename = unicodedata.normalize("NFC", "Rechnung März 2024.eml")
# Confirm the fixture is actually NFD
assert unicodedata.is_normalized("NFD", nfd_subject)
message = message_builder.create_message(
subject=nfd_subject,
from_="sender@example.com",
attachments=0,
)
mail_account_handler._handle_message(message, eml_rule)
queue_consumption_tasks_mock.assert_called_once()
call_kwargs = queue_consumption_tasks_mock.call_args.kwargs
consume_tasks = call_kwargs["consume_tasks"]
assert len(consume_tasks) == 1
overrides = consume_tasks[0].kwargs["overrides"]
assert overrides.filename == nfc_expected_filename
assert unicodedata.is_normalized("NFC", overrides.filename)
def test_already_nfc_attachment_filename_unchanged(
self,
directories,
queue_consumption_tasks_mock,
attachment_rule,
mail_account_handler,
message_builder,
):
"""An attachment filename already in NFC must pass through unchanged."""
nfc_filename = "Invoice_2024.pdf"
assert unicodedata.is_normalized("NFC", nfc_filename)
message = message_builder.create_message(
subject="Invoice",
from_="sender@example.com",
attachments=[
_AttachmentDef(filename=nfc_filename, content=b"%PDF-1.4 test"),
],
)
mail_account_handler._handle_message(message, attachment_rule)
call_kwargs = queue_consumption_tasks_mock.call_args.kwargs
consume_tasks = call_kwargs["consume_tasks"]
overrides = consume_tasks[0].kwargs["overrides"]
assert overrides.filename == nfc_filename
Generated
+13 -106
View File
@@ -2052,55 +2052,6 @@ redis = [
{ name = "redis", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
[[package]]
name = "lance-namespace"
version = "0.8.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "lance-namespace-urllib3-client", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/21/80/2b6eaa08c5e25915acaa6368a70211a25b5ba9d2d6006450e68a73936164/lance_namespace-0.8.0.tar.gz", hash = "sha256:c4a79ee221a3b2315c29863ad12d85fcf219a13158e26149d63e21dc4b4673a7", size = 10756, upload-time = "2026-06-01T08:47:10.183Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/4b/bd/7b40a08fb132fab39a6caebf832fdf6b9befc71be9413beb9be0a9d927d4/lance_namespace-0.8.0-py3-none-any.whl", hash = "sha256:782cf9e332f46bf06836722dd98b53ca8495ad98bb541501ff6876c89b67ec90", size = 12579, upload-time = "2026-06-01T08:47:10.91Z" },
]
[[package]]
name = "lance-namespace-urllib3-client"
version = "0.8.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "pydantic", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "python-dateutil", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "typing-extensions", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "urllib3", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
sdist = { url = "https://files.pythonhosted.org/packages/8c/37/06fcd5a8969381e0ba953d51990af8d331bdccbc62458bf2eed30d064573/lance_namespace_urllib3_client-0.8.0.tar.gz", hash = "sha256:4f060f05ebf3c04aeaeb0d2022cbe77648a3df290f02cd2c305e5797d0fc1fdd", size = 203710, upload-time = "2026-06-01T08:47:13.404Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/51/43/e280727feee958f303bc58d5fa912b07734a0831f756d841654d500c2c34/lance_namespace_urllib3_client-0.8.0-py3-none-any.whl", hash = "sha256:6734e341b726e5cc96a0cd257cef27eb9d03013f2d151526ee426cef8e63e228", size = 336669, upload-time = "2026-06-01T08:47:11.88Z" },
]
[[package]]
name = "lancedb"
version = "0.33.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "deprecation", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "lance-namespace", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "numpy", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "overrides", marker = "(python_full_version < '3.12' and sys_platform == 'darwin') or (python_full_version < '3.12' and sys_platform == 'linux')" },
{ name = "packaging", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "pyarrow", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "pydantic", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "tqdm", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
wheels = [
{ url = "https://files.pythonhosted.org/packages/09/2f/d5a4b2a5bb1f800936c76a6d8a4daf127a86fcab621eeb70b574a5adc774/lancedb-0.33.0-cp39-abi3-macosx_11_0_arm64.whl", hash = "sha256:d4eaf6fa7c2eac619208f1d396f4de635ee0f535673067118a31c1181575c48b", size = 48338115, upload-time = "2026-05-28T20:37:55.88Z" },
{ url = "https://files.pythonhosted.org/packages/07/12/31787b93a856b2c31382c7771dc22fb05575b70b87c9efe454269f4f0948/lancedb-0.33.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:6c6c2402ed2744245ae76c4167c0461da0a7a80f1608e0ec491c1548ea2b4302", size = 51162262, upload-time = "2026-05-28T20:37:59.101Z" },
{ url = "https://files.pythonhosted.org/packages/49/b7/081cc29f8e06bf12191b99ab3fe702aceebdb0914476b821a8c0445cacc8/lancedb-0.33.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl", hash = "sha256:7ebf1ffad811e6254a93931a79489ba1f21f48564bdfa06abae846f5fcaaf3e8", size = 54381368, upload-time = "2026-05-28T20:38:02.2Z" },
{ url = "https://files.pythonhosted.org/packages/1c/bd/e0f4bd621f10ecf96a801b0166e87799ed7ca5a9dbabcef9a6c766a58ef3/lancedb-0.33.0-cp39-abi3-manylinux_2_28_aarch64.whl", hash = "sha256:13da39f80adfea59e5831fe64e4166b2d70a2f843e6507bf644c4fe4c350087c", size = 51188986, upload-time = "2026-05-28T20:38:05.375Z" },
{ url = "https://files.pythonhosted.org/packages/d9/1a/a8647a432ac6aa59cdce1fc061a7050ea4278bcab364539b78af2ecf72d2/lancedb-0.33.0-cp39-abi3-manylinux_2_28_x86_64.whl", hash = "sha256:21b712825f0a00225e8974a41352c4ea84b0899ef8c23b17f672fadc38bd8346", size = 54440958, upload-time = "2026-05-28T20:38:08.474Z" },
]
[[package]]
name = "langdetect"
version = "1.0.9"
@@ -2892,15 +2843,6 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/1e/c1/d6e64ccd0536bf616556f0cad2b6d94a8125f508d25cfd814b1d2db4e2f1/openai-2.32.0-py3-none-any.whl", hash = "sha256:4dcc9badeb4bf54ad0d187453742f290226d30150890b7890711bda4f32f192f", size = 1162570, upload-time = "2026-04-15T22:28:17.714Z" },
]
[[package]]
name = "overrides"
version = "7.7.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/36/86/b585f53236dec60aba864e050778b25045f857e17f6e5ea0ae95fe80edd2/overrides-7.7.0.tar.gz", hash = "sha256:55158fa3d93b98cc75299b1e67078ad9003ca27945c76162c1c0766d6f91820a", size = 22812, upload-time = "2024-01-27T21:01:33.423Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/2c/ab/fc8290c6a4c722e5514d80f62b2dc4c4df1a68a41d1364e625c35990fcf3/overrides-7.7.0-py3-none-any.whl", hash = "sha256:c7ed9d062f78b8e4c1a7b70bd8796b35ead4d9f510227ef9c5dc7626c60d7e49", size = 17832, upload-time = "2024-01-27T21:01:31.393Z" },
]
[[package]]
name = "packaging"
version = "26.0"
@@ -2948,7 +2890,6 @@ dependencies = [
{ name = "ijson", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "imap-tools", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "jinja2", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "lancedb", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "langdetect", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "llama-index-core", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "llama-index-embeddings-huggingface", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2961,7 +2902,6 @@ dependencies = [
{ name = "openai", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "pathvalidate", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "pdf2image", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "pyarrow", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "python-dateutil", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "python-dotenv", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "python-gnupg", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
@@ -2973,6 +2913,7 @@ dependencies = [
{ name = "scikit-learn", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "sentence-transformers", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "setproctitle", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "sqlite-vec", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "tantivy", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "tika-client", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
{ name = "torch", version = "2.11.0", source = { registry = "https://download.pytorch.org/whl/cpu" }, marker = "sys_platform == 'darwin'" },
@@ -3099,7 +3040,6 @@ requires-dist = [
{ name = "ijson", specifier = ">=3.2" },
{ name = "imap-tools", specifier = "~=1.13.0" },
{ name = "jinja2", specifier = "~=3.1.5" },
{ name = "lancedb", specifier = "~=0.33.0" },
{ name = "langdetect", specifier = "~=1.0.9" },
{ name = "llama-index-core", specifier = ">=0.14.21" },
{ name = "llama-index-embeddings-huggingface", specifier = ">=0.6.1" },
@@ -3118,7 +3058,6 @@ requires-dist = [
{ name = "psycopg-c", marker = "python_full_version == '3.12.*' and platform_machine == 'x86_64' and sys_platform == 'linux' and extra == 'postgres'", url = "https://github.com/paperless-ngx/builder/releases/download/psycopg-trixie-3.3.0/psycopg_c-3.3.0-cp312-cp312-linux_x86_64.whl" },
{ name = "psycopg-c", marker = "(python_full_version != '3.12.*' and platform_machine == 'aarch64' and extra == 'postgres') or (python_full_version != '3.12.*' and platform_machine == 'x86_64' and extra == 'postgres') or (platform_machine != 'aarch64' and platform_machine != 'x86_64' and extra == 'postgres') or (sys_platform != 'linux' and extra == 'postgres')", specifier = "==3.3" },
{ name = "psycopg-pool", marker = "extra == 'postgres'", specifier = "==3.3" },
{ name = "pyarrow", specifier = ">=16" },
{ name = "python-dateutil", specifier = "~=2.9.0" },
{ name = "python-dotenv", specifier = "~=1.2.1" },
{ name = "python-gnupg", specifier = "~=0.5.4" },
@@ -3130,6 +3069,7 @@ requires-dist = [
{ name = "scikit-learn", specifier = "~=1.8.0" },
{ name = "sentence-transformers", specifier = ">=5.4.1" },
{ name = "setproctitle", specifier = "~=1.3.4" },
{ name = "sqlite-vec", specifier = "==0.1.9" },
{ name = "tantivy", specifier = "~=0.26.0" },
{ name = "tika-client", specifier = "~=0.11.0" },
{ name = "torch", specifier = "~=2.11.0", index = "https://download.pytorch.org/whl/cpu" },
@@ -3617,50 +3557,6 @@ version = "0.16.1"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/1d/c7/28220d37e041fe1df03e857fe48f768dcd30cd151480bf6f00da8713214a/py-ubjson-0.16.1.tar.gz", hash = "sha256:b9bfb8695a1c7e3632e800fb83c943bf67ed45ddd87cd0344851610c69a5a482", size = 50316, upload-time = "2020-04-18T15:05:57.698Z" }
[[package]]
name = "pyarrow"
version = "24.0.0"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/91/13/13e1069b351bdc3881266e11147ffccf687505dbb0ea74036237f5d454a5/pyarrow-24.0.0.tar.gz", hash = "sha256:85fe721a14dd823aca09127acbb06c3ca723efbd436c004f16bca601b04dcc83", size = 1180261, upload-time = "2026-04-21T10:51:25.837Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/62/c9/a47ab7ece0d86cbe6678418a0fbd1ac4bb493b9184a3891dfa0e7f287ae0/pyarrow-24.0.0-cp311-cp311-macosx_12_0_arm64.whl", hash = "sha256:b0e131f880cda8d04e076cee175a46fc0e8bc8b65c99c6c09dff6669335fde74", size = 35068898, upload-time = "2026-04-21T10:46:36.599Z" },
{ url = "https://files.pythonhosted.org/packages/d1/bc/8db86617a9a58008acf8913d6fed68ea2a46acb6de928db28d724c891a68/pyarrow-24.0.0-cp311-cp311-macosx_12_0_x86_64.whl", hash = "sha256:1b2fe7f9a5566401a0ef2571f197eb92358925c1f0c8dba305d6e43ea0871bb3", size = 36679915, upload-time = "2026-04-21T10:46:42.602Z" },
{ url = "https://files.pythonhosted.org/packages/eb/8e/fb178720400ef69db251eb4a9c3ccf4af269bc1feb5055529b8fc87170d1/pyarrow-24.0.0-cp311-cp311-manylinux_2_28_aarch64.whl", hash = "sha256:0b3537c00fb8d384f15ac1e79b6eb6db04a16514c8c1d22e59a9b95c8ba42868", size = 45697931, upload-time = "2026-04-21T10:46:48.403Z" },
{ url = "https://files.pythonhosted.org/packages/f3/27/99c42abe8e21b44f4917f62631f3aa31404882a2c41d8a4cd5c110e13d52/pyarrow-24.0.0-cp311-cp311-manylinux_2_28_x86_64.whl", hash = "sha256:14e31a3c9e35f1ab6356c6378f6f72830e6d2d5f1791df3774a7b097d18a6a1e", size = 48837449, upload-time = "2026-04-21T10:46:55.329Z" },
{ url = "https://files.pythonhosted.org/packages/36/b6/333749e2666e9032891125bf9c691146e92901bece62030ac1430e2e7c88/pyarrow-24.0.0-cp311-cp311-musllinux_1_2_aarch64.whl", hash = "sha256:b7d9a514e73bc42711e6a35aaccf3587c520024fe0a25d830a1a8a27c15f4f57", size = 49395949, upload-time = "2026-04-21T10:47:01.869Z" },
{ url = "https://files.pythonhosted.org/packages/17/25/c5201706a2dd374e8ba6ee3fd7a8c89fb7ffc16eed5217a91fd2bd7f7626/pyarrow-24.0.0-cp311-cp311-musllinux_1_2_x86_64.whl", hash = "sha256:b196eb3f931862af3fa84c2a253514d859c08e0d8fe020e07be12e75a5a9780c", size = 51912986, upload-time = "2026-04-21T10:47:09.872Z" },
{ url = "https://files.pythonhosted.org/packages/b4/a9/9686d9f07837f91f775e8932659192e02c74f9d8920524b480b85212cc68/pyarrow-24.0.0-cp312-cp312-macosx_12_0_arm64.whl", hash = "sha256:6233c9ed9ab9d1db47de57d9753256d9dcffbf42db341576099f0fd9f6bf4810", size = 34981559, upload-time = "2026-04-21T10:47:22.17Z" },
{ url = "https://files.pythonhosted.org/packages/80/b6/0ddf0e9b6ead3474ab087ae598c76b031fc45532bf6a63f3a553440fb258/pyarrow-24.0.0-cp312-cp312-macosx_12_0_x86_64.whl", hash = "sha256:f7616236ec1bc2b15bfdec22a71ab38851c86f8f05ff64f379e1278cf20c634a", size = 36663654, upload-time = "2026-04-21T10:47:28.315Z" },
{ url = "https://files.pythonhosted.org/packages/7c/3b/926382efe8ce27ba729071d3566ade6dfb86bdf112f366000196b2f5780a/pyarrow-24.0.0-cp312-cp312-manylinux_2_28_aarch64.whl", hash = "sha256:1617043b99bd33e5318ae18eb2919af09c71322ef1ca46566cdafc6e6712fb66", size = 45679394, upload-time = "2026-04-21T10:47:34.821Z" },
{ url = "https://files.pythonhosted.org/packages/b3/7a/829f7d9dfd37c207206081d6dad474d81dde29952401f07f2ba507814818/pyarrow-24.0.0-cp312-cp312-manylinux_2_28_x86_64.whl", hash = "sha256:6165461f55ef6314f026de6638d661188e3455d3ec49834556a0ebbdbace18bb", size = 48863122, upload-time = "2026-04-21T10:47:42.056Z" },
{ url = "https://files.pythonhosted.org/packages/5f/e8/f88ce625fe8babaae64e8db2d417c7653adb3019b08aae85c5ed787dc816/pyarrow-24.0.0-cp312-cp312-musllinux_1_2_aarch64.whl", hash = "sha256:3b13dedfe76a0ad2d1d859b0811b53827a4e9d93a0bcb05cf59333ab4980cc7e", size = 49376032, upload-time = "2026-04-21T10:47:48.967Z" },
{ url = "https://files.pythonhosted.org/packages/36/7a/82c363caa145fff88fb475da50d3bf52bb024f61917be5424c3392eaf878/pyarrow-24.0.0-cp312-cp312-musllinux_1_2_x86_64.whl", hash = "sha256:25ea65d868eb04015cd18e6df2fbe98f07e5bda2abefabcb88fce39a947716f6", size = 51929490, upload-time = "2026-04-21T10:47:55.981Z" },
{ url = "https://files.pythonhosted.org/packages/6f/d3/a1abf004482026ddc17f4503db227787fa3cfe41ec5091ff20e4fea55e57/pyarrow-24.0.0-cp313-cp313-macosx_12_0_arm64.whl", hash = "sha256:02b001b3ed4723caa44f6cd1af2d5c86aa2cf9971dacc2ffa55b21237713dfba", size = 34976759, upload-time = "2026-04-21T10:48:07.258Z" },
{ url = "https://files.pythonhosted.org/packages/4f/4a/34f0a36d28a2dd32225301b79daad44e243dc1a2bb77d43b60749be255c4/pyarrow-24.0.0-cp313-cp313-macosx_12_0_x86_64.whl", hash = "sha256:04920d6a71aabd08a0417709efce97d45ea8e6fb733d9ca9ecffb13c67839f68", size = 36658471, upload-time = "2026-04-21T10:48:13.347Z" },
{ url = "https://files.pythonhosted.org/packages/1f/78/543b94712ae8bb1a6023bcc1acf1a740fbff8286747c289cd9468fced2a5/pyarrow-24.0.0-cp313-cp313-manylinux_2_28_aarch64.whl", hash = "sha256:a964266397740257f16f7bb2e4f08a0c81454004beab8ff59dd531b73610e9f2", size = 45675981, upload-time = "2026-04-21T10:48:20.201Z" },
{ url = "https://files.pythonhosted.org/packages/84/9f/8fb7c222b100d314137fa40ec050de56cd8c6d957d1cfff685ce72f15b17/pyarrow-24.0.0-cp313-cp313-manylinux_2_28_x86_64.whl", hash = "sha256:6f066b179d68c413374294bc1735f68475457c933258df594443bb9d88ddc2a0", size = 48859172, upload-time = "2026-04-21T10:48:27.541Z" },
{ url = "https://files.pythonhosted.org/packages/a7/d3/1ea72538e6c8b3b475ed78d1049a2c518e655761ea50fe1171fc855fcab7/pyarrow-24.0.0-cp313-cp313-musllinux_1_2_aarch64.whl", hash = "sha256:1183baeb14c5f587b1ec52831e665718ce632caab84b7cd6b85fd44f96114495", size = 49385733, upload-time = "2026-04-21T10:48:34.7Z" },
{ url = "https://files.pythonhosted.org/packages/c3/be/c3d8b06a1ba35f2260f8e1f771abbee7d5e345c0937aab90675706b1690a/pyarrow-24.0.0-cp313-cp313-musllinux_1_2_x86_64.whl", hash = "sha256:806f24b4085453c197a5078218d1ee08783ebbba271badd153d1ae22a3ee804f", size = 51934335, upload-time = "2026-04-21T10:48:42.099Z" },
{ url = "https://files.pythonhosted.org/packages/17/1a/cff3a59f80b5b1658549d46611b67163f65e0664431c076ad728bf9d5af4/pyarrow-24.0.0-cp313-cp313t-macosx_12_0_arm64.whl", hash = "sha256:1a4e45017efbf115032e4475ee876d525e0e36c742214fbe405332480ecd6275", size = 35238554, upload-time = "2026-04-21T10:48:48.526Z" },
{ url = "https://files.pythonhosted.org/packages/a8/99/cce0f42a327bfef2c420fb6078a3eb834826e5d6697bf3009fe11d2ad051/pyarrow-24.0.0-cp313-cp313t-macosx_12_0_x86_64.whl", hash = "sha256:7986f1fa71cee060ad00758bcc79d3a93bab8559bf978fab9e53472a2e25a17b", size = 36782301, upload-time = "2026-04-21T10:48:55.181Z" },
{ url = "https://files.pythonhosted.org/packages/2a/66/8e560d5ff6793ca29aca213c53eec0dd482dd46cb93b2819e5aab52e4252/pyarrow-24.0.0-cp313-cp313t-manylinux_2_28_aarch64.whl", hash = "sha256:d3e0b61e8efb24ed38898e5cdc5fffa9124be480008d401a1f8071500494ae42", size = 45721929, upload-time = "2026-04-21T10:49:03.676Z" },
{ url = "https://files.pythonhosted.org/packages/27/0c/a26e25505d030716e078d9f16eb74973cbf0b33b672884e9f9da1c83b871/pyarrow-24.0.0-cp313-cp313t-manylinux_2_28_x86_64.whl", hash = "sha256:55a3bc1e3df3b5567b7d27ef551b2283f0c68a5e86f1cd56abc569da4f31335b", size = 48825365, upload-time = "2026-04-21T10:49:11.714Z" },
{ url = "https://files.pythonhosted.org/packages/5f/eb/771f9ecb0c65e73fe9dccdd1717901b9594f08c4515d000c7c62df573811/pyarrow-24.0.0-cp313-cp313t-musllinux_1_2_aarch64.whl", hash = "sha256:641f795b361874ac9da5294f8f443dfdbee355cf2bd9e3b8d97aaac2306b9b37", size = 49451819, upload-time = "2026-04-21T10:49:21.474Z" },
{ url = "https://files.pythonhosted.org/packages/48/da/61ae89a88732f5a785646f3ec6125dbb640fa98a540eb2b9889caa561403/pyarrow-24.0.0-cp313-cp313t-musllinux_1_2_x86_64.whl", hash = "sha256:8adc8e6ce5fccf5dc707046ae4914fd537def529709cc0d285d37a7f9cd442ca", size = 51909252, upload-time = "2026-04-21T10:49:31.164Z" },
{ url = "https://files.pythonhosted.org/packages/ad/80/d022a34ff05d2cbedd8ccf841fc1f532ecfa9eb5ed1711b56d0e0ea71fc9/pyarrow-24.0.0-cp314-cp314-macosx_12_0_arm64.whl", hash = "sha256:1cc9057f0319e26333b357e17f3c2c022f1a83739b48a88b25bfd5fa2dc18838", size = 35007997, upload-time = "2026-04-21T10:49:48.796Z" },
{ url = "https://files.pythonhosted.org/packages/1a/ff/f01485fda6f4e5d441afb8dd5e7681e4db18826c1e271852f5d3957d6a80/pyarrow-24.0.0-cp314-cp314-macosx_12_0_x86_64.whl", hash = "sha256:e6f1278ee4785b6db21229374a1c9e54ec7c549de5d1efc9630b6207de7e170b", size = 36678720, upload-time = "2026-04-21T10:49:55.858Z" },
{ url = "https://files.pythonhosted.org/packages/9e/c2/2d2d5fea814237923f71b36495211f20b43a1576f9a4d6da7e751a64ec6f/pyarrow-24.0.0-cp314-cp314-manylinux_2_28_aarch64.whl", hash = "sha256:adbbedc55506cbdabb830890444fb856bfb0060c46c6f8026c6c2f2cf86ae795", size = 45741852, upload-time = "2026-04-21T10:50:04.624Z" },
{ url = "https://files.pythonhosted.org/packages/8e/3a/28ba9c1c1ebdbb5f1b94dfebb46f207e52e6a554b7fe4132540fde29a3a0/pyarrow-24.0.0-cp314-cp314-manylinux_2_28_x86_64.whl", hash = "sha256:ae8a1145af31d903fa9bb166824d7abe9b4681a000b0159c9fb99c11bc11ad26", size = 48889852, upload-time = "2026-04-21T10:50:12.293Z" },
{ url = "https://files.pythonhosted.org/packages/df/51/4a389acfd31dca009f8fb82d7f510bb4130f2b3a8e18cf00194d0687d8ac/pyarrow-24.0.0-cp314-cp314-musllinux_1_2_aarch64.whl", hash = "sha256:d7027eba1df3b2069e2e8d80f644fa0918b68c46432af3d088ddd390d063ecde", size = 49445207, upload-time = "2026-04-21T10:50:20.677Z" },
{ url = "https://files.pythonhosted.org/packages/19/4b/0bab2b23d2ae901b1b9a03c0efd4b2d070256f8ce3fc43f6e58c167b2081/pyarrow-24.0.0-cp314-cp314-musllinux_1_2_x86_64.whl", hash = "sha256:e56a1ffe9bf7b727432b89104cc0849c21582949dd7bdcb34f17b2001a351a76", size = 51954117, upload-time = "2026-04-21T10:50:29.14Z" },
{ url = "https://files.pythonhosted.org/packages/79/4f/46a49a63f43526da895b1a45bbb51d5baf8e4d77159f8528fc3e5490007f/pyarrow-24.0.0-cp314-cp314t-macosx_12_0_arm64.whl", hash = "sha256:418e48ce50a45a6a6c73c454677203a9c75c966cb1e92ca3370959185f197a05", size = 35250387, upload-time = "2026-04-21T10:50:35.552Z" },
{ url = "https://files.pythonhosted.org/packages/a0/da/d5e0cd5ef00796922404806d5f00325cdadc3441ce2c13fe7115f2df9a64/pyarrow-24.0.0-cp314-cp314t-macosx_12_0_x86_64.whl", hash = "sha256:2f16197705a230a78270cdd4ea8a1d57e86b2fdcbc34a1f6aebc72e65c986f9a", size = 36797102, upload-time = "2026-04-21T10:50:42.417Z" },
{ url = "https://files.pythonhosted.org/packages/34/c7/5904145b0a593a05236c882933d439b5720f0a145381179063722fbfc123/pyarrow-24.0.0-cp314-cp314t-manylinux_2_28_aarch64.whl", hash = "sha256:fb24ac194bfc5e86839d7dcd52092ee31e5fe6733fe11f5e3b06ef0812b20072", size = 45745118, upload-time = "2026-04-21T10:50:49.324Z" },
{ url = "https://files.pythonhosted.org/packages/13/d3/cca42fe166d1c6e4d5b80e530b7949104d10e17508a90ae202dac205ce2a/pyarrow-24.0.0-cp314-cp314t-manylinux_2_28_x86_64.whl", hash = "sha256:9700ebd9a51f5895ce75ff4ac4b3c47a7d4b42bc618be8e713e5d56bacf5f931", size = 48844765, upload-time = "2026-04-21T10:50:55.579Z" },
{ url = "https://files.pythonhosted.org/packages/b0/49/942c3b79878ba928324d1e17c274ed84581db8c0a749b24bcf4cbdf15bd3/pyarrow-24.0.0-cp314-cp314t-musllinux_1_2_aarch64.whl", hash = "sha256:d8ddd2768da81d3ee08cfea9b597f4abb4e8e1dc8ae7e204b608d23a0d3ab699", size = 49471890, upload-time = "2026-04-21T10:51:02.439Z" },
{ url = "https://files.pythonhosted.org/packages/76/97/ff71431000a75d84135a1ace5ca4ba11726a231a8007bbb320a4c54075d5/pyarrow-24.0.0-cp314-cp314t-musllinux_1_2_x86_64.whl", hash = "sha256:61a3d7eaa97a14768b542f3d284dc6400dd2470d9f080708b13cd46b6ae18136", size = 51932250, upload-time = "2026-04-21T10:51:10.576Z" },
]
[[package]]
name = "pyasn1"
version = "0.6.3"
@@ -4772,6 +4668,17 @@ asyncio = [
{ name = "greenlet", marker = "sys_platform == 'darwin' or sys_platform == 'linux'" },
]
[[package]]
name = "sqlite-vec"
version = "0.1.9"
source = { registry = "https://pypi.org/simple" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/68/85/9fad0045d8e7c8df3e0fa5a56c630e8e15ad6e5ca2e6106fceb666aa6638/sqlite_vec-0.1.9-py3-none-macosx_10_6_x86_64.whl", hash = "sha256:1b62a7f0a060d9475575d4e599bbf94a13d85af896bc1ce86ee80d1b5b48e5fb", size = 131171, upload-time = "2026-03-31T08:02:31.717Z" },
{ url = "https://files.pythonhosted.org/packages/a4/3d/3677e0cd2f92e5ebc43cd29fbf565b75582bff1ccfa0b8327c7508e1084f/sqlite_vec-0.1.9-py3-none-macosx_11_0_arm64.whl", hash = "sha256:1d52e30513bae4cc9778ddbf6145610434081be4c3afe57cd877893bad9f6b6c", size = 165434, upload-time = "2026-03-31T08:02:32.712Z" },
{ url = "https://files.pythonhosted.org/packages/00/d4/f2b936d3bdc38eadcbd2a87875815db36430fab0363182ba5d12cd8e0b51/sqlite_vec-0.1.9-py3-none-manylinux_2_17_aarch64.manylinux2014_aarch64.whl", hash = "sha256:4e921e592f24a5f9a18f590b6ddd530eb637e2d474e3b1972f9bbeb773aa3cb9", size = 160076, upload-time = "2026-03-31T08:02:33.796Z" },
{ url = "https://files.pythonhosted.org/packages/6f/ad/6afd073b0f817b3e03f9e37ad626ae341805891f23c74b5292818f49ac63/sqlite_vec-0.1.9-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux1_x86_64.whl", hash = "sha256:1515727990b49e79bcaf75fdee2ffc7d461f8b66905013231251f1c8938e7786", size = 163388, upload-time = "2026-03-31T08:02:34.888Z" },
]
[[package]]
name = "sqlparse"
version = "0.5.5"