Commit Graph

11562 Commits

Author SHA1 Message Date
stumpylog 75a31ee09b Extracts some common code into helpers instead of duplication 2026-06-05 13:31:32 -07:00
stumpylog a23888aa1b Simplify table existence checking and ensuring a table exists 2026-06-05 13:15:50 -07:00
stumpylog e3ebe7cda1 Abstracts the modified check into the vector store 2026-06-05 13:06:44 -07:00
stumpylog 8533b99adf Don't need a dead parameter 2026-06-05 13:02:45 -07:00
stumpylog b54b8a23ce Don't always re-create the document_id index, do it only if not already existing 2026-06-05 12:58:22 -07:00
stumpylog a5f7a5561d ensure the llm_dir exists for the write_store too 2026-06-05 12:49:41 -07:00
stumpylog 3c2ef25edd Fixes this test so it works regardless of cwd 2026-06-05 11:56:49 -07:00
stumpylog 09b3063344 Small targeted tests for coverage or pragma no cover 2026-06-05 11:53:08 -07:00
stumpylog 60faa3f20f Removes the spec andplan files 2026-06-05 11:43:42 -07:00
stumpylog ca6dca0efe Adds a new compact sub-command + handler to force compact lancedb version 2026-06-05 11:43:42 -07:00
stumpylog 3aa83c9e4c To reduce embedding size, don't store the metadata in the body. Body is content + a few other things, metadata keys hold the metadata 2026-06-05 11:43:42 -07:00
stumpylog e7f8bf0542 Globally reduces httpx logging 2026-06-05 11:43:42 -07:00
stumpylog 707c3d7842 fix(ai): sort document_id filter values; add chat filter scoping test
- chat.py: use sorted() for doc_ids in the MetadataFilters IN clause,
  matching the same pattern used in query_similar_documents. Ensures
  deterministic filter construction regardless of document iteration order.
- test_chat.py: add test_chat_filter_contains_only_requested_document_ids
  verifying that the retriever receives a filter scoped only to the
  requested documents (not all indexed documents). Inspired by
  test_document_filtered_retriever_applies_lancedb_metadata_filter in
  origin/feature/beta-lancedb.

Co-Authored-By: shamoon <shamoon@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog eab0a4abea fix(ai): rename vector_store_file_exists -> llm_index_exists in views.py
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog a9d339157f refactor(ai): write_store() context manager wraps the FileLock
All mutating index operations (upsert, delete, rebuild, compact) now use
  with write_store() as store:
instead of explicit FileLock + get_vector_store() at each call site.
Read paths continue to use get_vector_store() directly (no lock needed).
Also type-annotates test fixture params throughout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog 2c3c892dae test(ai): type-annotate fixture parameters
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog 17755a2c58 refactor(ai): cleanup pass — naming, batched embedding, remove dead wrappers
- Rename vector_store_file_exists -> llm_index_exists (accurate now)
- Rename _iter_existing_modified -> _stored_modified_times; project away
  vector column (cheap scan) and return dict[doc_id, modified_str] directly
- Drop _index_lock_path() indirection; inline settings.LLM_INDEX_LOCK
- Move LLM_INDEX_LOCK alongside the index dir (drop_table is safe; no rmtree)
- Drop current_embedding_dim() redirect; callers use get_embedding_dim()
- Drop lazy-import explanatory comments (constraint lives in CLAUDE.md)
- Batch embedding calls via get_text_embedding_batch() in all three loops
- get_nodes: raise NotImplementedError for node_ids (was silently ignored)
- has_nodes(): cheap limit(1) existence check; chat.py uses it instead of
  get_nodes() which materialized all matching rows
- conftest: use mocker fixture (pytest-mock) instead of bare patch; add
  LLM_INDEX_LOCK to temp_llm_index_dir override; type-annotate mock_embed_model

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog 6775fed68e types(ai): pass pyrefly for the LanceDB vector store code
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog 2673bc7b9c refactor(ai): drop unused delete_nodes and node_ids path from the adapter
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog 97fb1ec313 test(ai): drop FAISS-internal assertions
Remove tests that validated removed internals (get_or_create_storage_context,
remove_document_docstore_nodes, index.docstore.docs) and rewrite the remaining
ones to assert against the LanceDB store directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog e5acadff52 refactor(ai): chat uses a stock filtered retriever
Delete _get_document_filtered_retriever (74-line custom FAISS retriever
with expanding top_k loop) and rewrite _stream_chat_with_documents to use
a stock VectorIndexRetriever with MetadataFilters(IN). The no-content
pre-check now calls index.vector_store.get_nodes(filters=...) which
returns [] cleanly for un-indexed documents. Move FakeEmbedding and
mock_embed_model fixture to conftest.py so both test_chat.py and
test_ai_indexing.py share them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog 6e69dc78a7 feat(ai): dimension guard and FAISS index migration
Drops migrate_stale_faiss_index (users delete llm_index/ manually on upgrade).

Keeps embedding_dim_mismatch to force a rebuild when the model changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog b855eba878 feat(ai): dimension guard and FAISS index migration
Adds current_embedding_dim() to embedding.py, migrate_stale_faiss_index()
and embedding_dim_mismatch() to indexing.py, and wires both into
update_llm_index so that stale FAISS directories are wiped on startup and
embedding model changes force a full index rebuild.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog 1f2af9087c refactor(ai): query_similar_documents via metadata filter
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog 6bb8212f20 refactor(ai): group new LanceDB indexing tests in a class
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:42 -07:00
stumpylog f41e32cfcd refactor(ai): build the index from the LanceDB store alone (lazy import)
Replace get_or_create_storage_context with get_vector_store() (lazy import
of paperless_ai.vector_store inside the function), rewrite load_or_build_index
to use VectorStoreIndex.from_vector_store, and rewrite vector_store_file_exists
to use store.table_exists(). Add LLM_INDEX_TABLE constant and TYPE_CHECKING-only
import of PaperlessLanceVectorStore. Delete remove_document_docstore_nodes and
rewire llm_index_add_or_update_document, llm_index_remove_document, and
update_llm_index to use upsert_document/delete/drop_table on the LanceDB store.
Serialize tags list as JSON string to satisfy flat_metadata validation. Add
test_get_vector_store_roundtrip, test_add_then_remove_document,
test_update_shrinks_chunks_without_orphans, and the subprocess lazy-import guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 2f5d199fef feat(ai): tie LlamaDocument id to the paperless document id
Set id_=str(document.id) on the LlamaDocument constructor in
build_document_node so that every chunk node's ref_doc_id equals the
paperless document pk, enabling the LanceDB adapter's delete(str(doc.id))
and doc_id column to work correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 848f140c04 refactor(ai): drop version-defensive vector-index check (lancedb is pinned)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 4219622e1b refactor(ai): log when the vector-index check fails
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 52b77413f9 feat(ai): ANN index threshold, scalar index, and compaction
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 849c0fbe23 fix(ai): upsert empty-nodes path deletes by document_id
When upsert_document receives an empty nodes list, delete existing
chunks using the document_id column directly (consistent with the
merge_insert prune predicate) rather than calling delete() which
filters on doc_id. Guard for a missing table to avoid a no-op.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog f9e5480c64 feat(ai): atomic upsert_document on the LanceDB store
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 9367cf531e docs(plan): add Task 13 — pass new AI code through pyrefly
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 5dac7c897f refactor(ai): address review on the LanceDB adapter
- Fix delete() to use single-quote delimiter consistent with _escape
- Fix _distance comment: L2 not squared-L2
- Fix similarity_top_k zero-guard to use explicit None check
- Replace deprecated table_names() with list_tables().tables (lancedb 0.33)
- Add add() Sequence[BaseNode] signature with collections.abc.Sequence import
- Add test_build_where_or_condition for OR filter branch coverage

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 0bab965d9c feat(ai): add LanceDB-backed vector store adapter
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog a7ea06c820 build: replace faiss-cpu with lancedb for the AI vector store
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog 714d3d68c5 Design: Implementation plan for the LanceDB vector store
Task-by-task TDD plan implementing the LanceDB design spec: dependency
swap, the PaperlessLanceVectorStore adapter, atomic merge_insert upsert,
ANN threshold + scalar index + compaction, the indexing/chat/similar
rewires, FAISS migration, and a lazy-import guard test so non-AI paths
(management commands) never drag in llama_index/lancedb/pyarrow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
stumpylog f0d233631a Design: Replace FAISS vector store with LanceDB (custom adapter)
Spec for swapping the AI feature's llama-index FAISS StorageContext trio
(FaissVectorStore + SimpleDocumentStore + SimpleIndexStore) for LanceDB via
a custom BasePydanticVectorStore adapter (no llama-index-vector-stores-lancedb,
no pandas).

Covers: disk-resident memory-mapped storage, native merge_insert upsert with
when_not_matched_by_source_delete, MetadataFilters(IN) filtering on a top-level
document_id column, auto IVF ANN threshold (IVF_FLAT fallback), MVCC compaction
via optimize(cleanup_older_than=...), migration, concurrency, and testing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 11:43:41 -07:00
shamoon 449fd97b1f Fix (beta): respect disable state for suggest endpoint, require change perms (#12942) 2026-06-05 14:16:53 +00:00
Trenton H fa0c4368d7 Fix: Ensure checksum comparison is using SHA256 in file handling (#12939) 2026-06-05 06:46:45 -07:00
shamoon 289d797837 Merge branch 'dev' into beta 2026-06-03 15:12:44 -07:00
dependabot[bot] f3eb8d4f58 docker-compose(deps): bump apache/tika in /docker/compose (#12912)
Bumps apache/tika from 3.2.3.0 to 3.3.1.0.

---
updated-dependencies:
- dependency-name: apache/tika
  dependency-version: 3.3.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 13:13:14 -07:00
dependabot[bot] eab964124d docker-compose(deps): bump gotenberg/gotenberg in /docker/compose (#12910)
Bumps gotenberg/gotenberg from 8.27 to 8.33.

---
updated-dependencies:
- dependency-name: gotenberg/gotenberg
  dependency-version: '8.33'
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 12:40:18 -07:00
Trenton H 7ef6ba69e6 Fix: Validate the AI backend settings earlier instead of crashing inside the AI module (#12903) 2026-06-03 12:16:09 -07:00
dependabot[bot] 2e9b07b77f docker-compose(deps): Bump nginx in /docker/compose (#12911)
Bumps nginx from 1.29.5-alpine to 1.31.1-alpine.

---
updated-dependencies:
- dependency-name: nginx
  dependency-version: 1.31.1-alpine
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 11:41:13 -07:00
Trenton H abdcdccf08 Chore(deps): Silence a couple more vulnerabilities here (#12797) 2026-06-03 09:28:00 -07:00
shamoon 1663ed170c Enhancement (beta): add direct LLM language setting (#12906) 2026-06-03 15:53:22 +00:00
dependabot[bot] 59f22a3d59 Chore(deps-dev): Bump @playwright/test from 1.59.1 to 1.60.0 in /src-ui (#12919)
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
Signed-off-by: dependabot[bot] <support@github.com>
2026-06-03 15:49:50 +00:00
shamoon 47a6fcfc39 Fix (beta): correctly apply i18n in suggestions dropdown (#12905) 2026-06-03 08:40:06 -07:00
dependabot[bot] edcc78d450 Chore(deps-dev): Bump @types/node from 25.6.0 to 25.9.1 in /src-ui (#12915)
Bumps [@types/node](https://github.com/DefinitelyTyped/DefinitelyTyped/tree/HEAD/types/node) from 25.6.0 to 25.9.1.
- [Release notes](https://github.com/DefinitelyTyped/DefinitelyTyped/releases)
- [Commits](https://github.com/DefinitelyTyped/DefinitelyTyped/commits/HEAD/types/node)

---
updated-dependencies:
- dependency-name: "@types/node"
  dependency-version: 25.9.1
  dependency-type: direct:development
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2026-06-03 15:26:15 +00:00