Commit Graph

11515 Commits

Author SHA1 Message Date
stumpylog 788ae5d4e5 refactor(ai): chat uses a stock filtered retriever
Delete _get_document_filtered_retriever (74-line custom FAISS retriever
with expanding top_k loop) and rewrite _stream_chat_with_documents to use
a stock VectorIndexRetriever with MetadataFilters(IN). The no-content
pre-check now calls index.vector_store.get_nodes(filters=...) which
returns [] cleanly for un-indexed documents. Move FakeEmbedding and
mock_embed_model fixture to conftest.py so both test_chat.py and
test_ai_indexing.py share them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:08 -07:00
stumpylog d0a7c47f92 feat(ai): dimension guard and FAISS index migration
Drops migrate_stale_faiss_index (users delete llm_index/ manually on upgrade).

Keeps embedding_dim_mismatch to force a rebuild when the model changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:08 -07:00
stumpylog d70b08296c feat(ai): dimension guard and FAISS index migration
Adds current_embedding_dim() to embedding.py, migrate_stale_faiss_index()
and embedding_dim_mismatch() to indexing.py, and wires both into
update_llm_index so that stale FAISS directories are wiped on startup and
embedding model changes force a full index rebuild.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:07 -07:00
stumpylog d9b2c4fa86 refactor(ai): query_similar_documents via metadata filter
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:07 -07:00
stumpylog d641925c4d refactor(ai): group new LanceDB indexing tests in a class
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:07 -07:00
stumpylog e9c3f04b8b refactor(ai): build the index from the LanceDB store alone (lazy import)
Replace get_or_create_storage_context with get_vector_store() (lazy import
of paperless_ai.vector_store inside the function), rewrite load_or_build_index
to use VectorStoreIndex.from_vector_store, and rewrite vector_store_file_exists
to use store.table_exists(). Add LLM_INDEX_TABLE constant and TYPE_CHECKING-only
import of PaperlessLanceVectorStore. Delete remove_document_docstore_nodes and
rewire llm_index_add_or_update_document, llm_index_remove_document, and
update_llm_index to use upsert_document/delete/drop_table on the LanceDB store.
Serialize tags list as JSON string to satisfy flat_metadata validation. Add
test_get_vector_store_roundtrip, test_add_then_remove_document,
test_update_shrinks_chunks_without_orphans, and the subprocess lazy-import guard.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:06 -07:00
stumpylog 9a40b4ac9d feat(ai): tie LlamaDocument id to the paperless document id
Set id_=str(document.id) on the LlamaDocument constructor in
build_document_node so that every chunk node's ref_doc_id equals the
paperless document pk, enabling the LanceDB adapter's delete(str(doc.id))
and doc_id column to work correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:06 -07:00
stumpylog b2e0dbef46 refactor(ai): drop version-defensive vector-index check (lancedb is pinned)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:06 -07:00
stumpylog 98a5a583f3 refactor(ai): log when the vector-index check fails
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:06 -07:00
stumpylog 0421bfcf54 feat(ai): ANN index threshold, scalar index, and compaction
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:06 -07:00
stumpylog fa74bb77b3 fix(ai): upsert empty-nodes path deletes by document_id
When upsert_document receives an empty nodes list, delete existing
chunks using the document_id column directly (consistent with the
merge_insert prune predicate) rather than calling delete() which
filters on doc_id. Guard for a missing table to avoid a no-op.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:05 -07:00
stumpylog f0311e77d4 feat(ai): atomic upsert_document on the LanceDB store
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:05 -07:00
stumpylog 5cdd9faa56 docs(plan): add Task 13 — pass new AI code through pyrefly
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:05 -07:00
stumpylog 0054f5946b refactor(ai): address review on the LanceDB adapter
- Fix delete() to use single-quote delimiter consistent with _escape
- Fix _distance comment: L2 not squared-L2
- Fix similarity_top_k zero-guard to use explicit None check
- Replace deprecated table_names() with list_tables().tables (lancedb 0.33)
- Add add() Sequence[BaseNode] signature with collections.abc.Sequence import
- Add test_build_where_or_condition for OR filter branch coverage

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:05 -07:00
stumpylog b758cd1bdb feat(ai): add LanceDB-backed vector store adapter
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:05 -07:00
stumpylog df4607a492 build: replace faiss-cpu with lancedb for the AI vector store
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:04 -07:00
stumpylog 69ca36a16d Design: Implementation plan for the LanceDB vector store
Task-by-task TDD plan implementing the LanceDB design spec: dependency
swap, the PaperlessLanceVectorStore adapter, atomic merge_insert upsert,
ANN threshold + scalar index + compaction, the indexing/chat/similar
rewires, FAISS migration, and a lazy-import guard test so non-AI paths
(management commands) never drag in llama_index/lancedb/pyarrow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:03 -07:00
stumpylog c9ee9edb95 Design: Replace FAISS vector store with LanceDB (custom adapter)
Spec for swapping the AI feature's llama-index FAISS StorageContext trio
(FaissVectorStore + SimpleDocumentStore + SimpleIndexStore) for LanceDB via
a custom BasePydanticVectorStore adapter (no llama-index-vector-stores-lancedb,
no pandas).

Covers: disk-resident memory-mapped storage, native merge_insert upsert with
when_not_matched_by_source_delete, MetadataFilters(IN) filtering on a top-level
document_id column, auto IVF ANN threshold (IVF_FLAT fallback), MVCC compaction
via optimize(cleanup_older_than=...), migration, concurrency, and testing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:41:03 -07:00
Trenton H abdcdccf08 Chore(deps): Silence a couple more vulnerabilities here (#12797) 2026-06-03 09:28:00 -07:00
shamoon 1663ed170c Enhancement (beta): add direct LLM language setting (#12906) 2026-06-03 15:53:22 +00:00
shamoon 47a6fcfc39 Fix (beta): correctly apply i18n in suggestions dropdown (#12905) 2026-06-03 08:40:06 -07:00
Trenton H 98dc191194 Fix: Lock AI index during reading and don't index documents many times during a bulk update (#12899)
* Fix: Move LLM index lock outside index dir and skip per-doc tasks on bulk update

Two concurrency bugs from #12893:

[P1] Lock file lived inside LLM_INDEX_DIR. A rebuild calls
shutil.rmtree(LLM_INDEX_DIR), deleting the lock while a worker still
held it. A second worker then acquired a fresh lock on the new path and
ran concurrently, defeating serialisation. Move the lock to
DATA_DIR/locks/llm_index.lock (a new settings constant LLM_INDEX_LOCK)
so rmtree cannot touch it. The locks/ dir is created at settings load
time, matching the existing pattern for LOGGING_DIR.

[P2] document_updated was connected to add_or_update_document_in_llm_index
in apps.py. bulk_update_documents() emits document_updated for every
document in the batch, queuing N per-document LLM tasks, and then also
calls update_llm_index(rebuild=False) once at the end. Pass
skip_ai_index=True when sending document_updated from the bulk path so
the handler skips the per-document enqueue; the existing batch call at
the end of bulk_update_documents is the only LLM update for that path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: ghost vectors leave KeyError-prone nodes_dict entries after deletion

docstore.delete_document() removes a node from the docstore but leaves its
entry in index_struct.nodes_dict (the FAISS positional-id to node-UUID map).
A subsequent similarity query resolves the ghost position to the deleted UUID,
finds nothing in fetched_nodes_by_id, and raises KeyError inside
_insert_fetched_nodes_into_query_result.

Purge stale nodes_dict entries after each docstore deletion and re-sync the
mutated index_struct into the kvstore so persist() writes the updated mapping.
Dead FAISS vectors remain in the flat index until the next full rebuild
(IndexFlatL2 is append-only); add a try/except KeyError around
retriever.retrieve() as a defensive fallback for any residual ghost positions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: acquire index lock in query_similar_documents

query_similar_documents() loaded the index and ran the FAISS retriever
without holding the file lock. All write paths (update_llm_index,
llm_index_add_or_update_document, llm_index_remove_document) hold
FileLock(_index_lock_path()), so a concurrent rebuild calling
shutil.rmtree(LLM_INDEX_DIR) while a read is mid-load produces an IOError
or corrupt partial state.

Wrap the load_or_build_index() call and all subsequent retriever work inside
FileLock. The early-return guards (vector_store_file_exists check, empty
allowed_document_ids) remain outside the lock; the DB query for the final
result set also stays outside.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: skip LLM index enqueue on document_updated during version addition

When a document is consumed as a new version of an existing document, the
consumer fires document_consumption_finished (which triggers
add_or_update_document_in_llm_index) and then document_updated for the root
document. Both signals are connected to the same handler, so the root document
was enqueued for LLM indexing twice per version-addition event.

Pass skip_ai_index=True on the consumer's version-addition document_updated
send so the handler's existing guard suppresses the duplicate enqueue.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Test: bulk_update_documents must not enqueue per-doc LLM tasks

With AI enabled, bulk_update_documents() sends document_updated for every
document in the batch. The skip_ai_index=True kwarg (added in the P2 fix)
prevents add_or_update_document_in_llm_index from enqueuing a per-document
task for each one. Only the single update_llm_index call at the end should run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Debug level log sure

* Update src/paperless_ai/indexing.py

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>

* Apply suggestion from @shamoon

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-06-02 10:46:29 -07:00
Trenton H 2c58d86380 Fix: Minor fixes for the AI indexing (#12893)
* Fix: Remove all nodes for multi-chunk documents in update_llm_index incremental path

The existing_nodes dict comprehension keyed on document_id silently dropped all
but the last node per document, so only that one node was deleted when a
modified document was re-indexed, leaving all other chunks as ghost vectors in
the FAISS index. Switch to a defaultdict(list) that collects every node per
document_id, then iterate and delete all of them before inserting fresh nodes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: Wire document_updated signal to LLM index update handler

Connect document_updated to add_or_update_document_in_llm_index in
DocumentsConfig.ready() so REST API edits (PATCH /api/documents/{id}/)
enqueue an LLM vector store update, matching the existing
document_consumption_finished behavior.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: Add file lock around FAISS index mutations to prevent concurrent write corruption

Two concurrent Celery workers calling llm_index_add_or_update_document or
llm_index_remove_document each loaded the same on-disk index independently,
made their own change, and the last writer silently overwrote the first's
update. Wrap both functions and the rebuild/persist body of update_llm_index
in a filelock.FileLock keyed on LLM_INDEX_DIR/index.lock. Add a TOCTOU
comment on queue_llm_index_update_if_needed explaining the residual risk
(duplicate rebuild tasks are wasteful but not corrupting because the lock
serialises the actual write).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: Apply _normalize() in extract_unmatched_names to prevent duplicate suggestions

extract_unmatched_names was using .lower() while _match_names_to_queryset
uses _normalize() (which also strips punctuation). A name like "J. Smith"
matched to existing correspondent "J Smith" would still appear in the
unmatched list, causing duplicate object creation.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: Skip LLM index update gracefully when document has no indexable content

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: Persist empty index when all documents are deleted to clear stale FAISS vectors

The early-return guard in update_llm_index fired before persist() when no
documents existed, leaving a stale on-disk FAISS index that returned phantom
hits for deleted document IDs. Now the guard only returns early for the
incremental (rebuild=False) path when no index exists on disk; the rebuild
path always continues through to persist(), producing an empty clean index.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Chore: Simplify incremental index update — use docs.values() and deduplicate node extend

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 13:40:49 -07:00
shamoon 52222d23d3 Fix (beta): dont use tool calling with ollama (#12896) 2026-06-01 12:12:23 -07:00
shamoon 27426c04b0 Enhancement: try to respect language for AI suggestions (#12894) 2026-06-01 12:11:46 -07:00
shamoon f6c865bf47 Enhancement: AI LLM chunk size and context window config (#12891) 2026-06-01 17:56:21 +00:00
Trenton H bb860a5834 Fix: Improvements for security around the AI (#12895)
* Fix: Validate and limit chat question input in ChatStreamingView

Add max_length=4000 to ChatStreamingSerializer.q and replace the bare
request.data["q"] read with proper serializer.is_valid(raise_exception=True)
so oversized or missing questions are rejected with HTTP 400 before
reaching the LLM.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: Add defensive prompt framing to mark document content as untrusted

* Also adds a system prompt which is treated higher that this is untrusted stuff

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-06-01 10:03:27 -07:00
Trenton H 889ccfd67a Fix: Fold query and autocomplete terms with Tantivy's ascii_fold so special letters match (#12868) 2026-05-29 16:42:07 -07:00
Trenton H bbceb5dac6 Fix: Don't store autocomplete_word, only index it (#12867) 2026-05-29 14:09:04 -07:00
Trenton H 98a7ed32e3 Fix: Preserve Whoosh date range swapping in Tantviy (#12866) 2026-05-29 20:21:59 +00:00
Trenton H 25a7b2038a Fix: Always release search index writer, even on failure, so the write lock doesn't persist for later (#12865) 2026-05-29 19:38:58 +00:00
Trenton H 97e3c75720 Fix: Handle CJK title, content and metadata searching (#12862) 2026-05-29 19:11:55 +00:00
Trenton H 11c62757ef Fix: Restrict date query rewrites to date or datetime fields only (#12864) 2026-05-29 11:59:30 -07:00
Trenton H 4a8d79be6f Fix: Missing call to tanvity wait_merging_threads (#12863) 2026-05-29 10:32:15 -07:00
Trenton H 525b986e23 Fix: Handle tanvity index lock contention (#12856)
Implements and tests a retry with backoff + jitter for aquring the index update lock.  If we still can't get it, dispatch a celery task to handle it later instead (also with retry)

Signed-off-by: stumpylog <797416+stumpylog@users.noreply.github.com>
2026-05-27 09:47:13 -07:00
shamoon 4ce5f2022c Fix (beta): better catch chat errors (#12854) 2026-05-26 19:05:47 +00:00
shamoon ab47185712 Performance (beta): dont re-build vector index with each chat (#12847) 2026-05-26 11:36:05 -07:00
shamoon 01d8fad622 Security: fixes for v3 beta (#12838) 2026-05-26 16:46:23 +00:00
shamoon da3e845b8b Fix (beta): normalize long punctuation chunks to improve embedding (#12848) 2026-05-26 09:32:38 -07:00
shamoon 0a6e0db186 Fix: use chord.on_error before apply_async (#12842) 2026-05-24 14:42:11 -07:00
shamoon 15682231b2 Chore: fix sonarcube logger warnings 2026-05-20 08:54:00 -07:00
Trenton H df861189fa Fix: Don't use smaller integer fields for some workflow fields (#12834) 2026-05-20 14:39:01 +00:00
Trenton H bd86dca57e Fix: Password removal source file location (#12830)
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-05-19 13:52:04 -07:00
Trenton H 9f45737b94 Upgrades this dep so it handles newer models, like gpt-5-5 which require a locked 1.0 temperature value (#12824) 2026-05-18 12:30:03 -07:00
shamoon 83d59ad3bf Fix (beta): use correct html button type for custom field buttons (#12819) 2026-05-17 19:15:03 -07:00
Trenton H ff3360310b Fix: Defer password removal workflow action until the file is in place (#12814) 2026-05-16 17:14:37 -07:00
Trenton H 9a68dcdddf Fix: Allow setting allauth rate limit configuration settings (#12798) 2026-05-14 07:29:49 -07:00
Trenton H 9a78882b5a Fix: Don't embed the metadata which is already embedded into the context (#12795) 2026-05-13 09:01:34 -07:00
Trenton H 7e381f204e Fix: Sanitize dash or plus from the text search path (#12789) 2026-05-12 12:41:38 -07:00
shamoon 5f42854d99 Fix: two more css tweaks to tasks page 2026-05-11 13:50:02 -07:00