paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-08-03 01:22:17 +00:00

Author	SHA1	Message	Date
stumpylogandClaude Opus 4.8	788ae5d4e5	refactor(ai): chat uses a stock filtered retriever Delete _get_document_filtered_retriever (74-line custom FAISS retriever with expanding top_k loop) and rewrite _stream_chat_with_documents to use a stock VectorIndexRetriever with MetadataFilters(IN). The no-content pre-check now calls index.vector_store.get_nodes(filters=...) which returns [] cleanly for un-indexed documents. Move FakeEmbedding and mock_embed_model fixture to conftest.py so both test_chat.py and test_ai_indexing.py share them. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:08 -07:00
stumpylogandClaude Opus 4.8	d0a7c47f92	feat(ai): dimension guard and FAISS index migration Drops migrate_stale_faiss_index (users delete llm_index/ manually on upgrade). Keeps embedding_dim_mismatch to force a rebuild when the model changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:08 -07:00
stumpylogandClaude Opus 4.8	d70b08296c	feat(ai): dimension guard and FAISS index migration Adds current_embedding_dim() to embedding.py, migrate_stale_faiss_index() and embedding_dim_mismatch() to indexing.py, and wires both into update_llm_index so that stale FAISS directories are wiped on startup and embedding model changes force a full index rebuild. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:07 -07:00
stumpylogandClaude Opus 4.8	d9b2c4fa86	refactor(ai): query_similar_documents via metadata filter Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:07 -07:00
stumpylogandClaude Opus 4.8	d641925c4d	refactor(ai): group new LanceDB indexing tests in a class Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:07 -07:00
stumpylogandClaude Opus 4.8	e9c3f04b8b	refactor(ai): build the index from the LanceDB store alone (lazy import) Replace get_or_create_storage_context with get_vector_store() (lazy import of paperless_ai.vector_store inside the function), rewrite load_or_build_index to use VectorStoreIndex.from_vector_store, and rewrite vector_store_file_exists to use store.table_exists(). Add LLM_INDEX_TABLE constant and TYPE_CHECKING-only import of PaperlessLanceVectorStore. Delete remove_document_docstore_nodes and rewire llm_index_add_or_update_document, llm_index_remove_document, and update_llm_index to use upsert_document/delete/drop_table on the LanceDB store. Serialize tags list as JSON string to satisfy flat_metadata validation. Add test_get_vector_store_roundtrip, test_add_then_remove_document, test_update_shrinks_chunks_without_orphans, and the subprocess lazy-import guard. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:06 -07:00
stumpylogandClaude Opus 4.8	9a40b4ac9d	feat(ai): tie LlamaDocument id to the paperless document id Set id_=str(document.id) on the LlamaDocument constructor in build_document_node so that every chunk node's ref_doc_id equals the paperless document pk, enabling the LanceDB adapter's delete(str(doc.id)) and doc_id column to work correctly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:06 -07:00
stumpylogandClaude Opus 4.8	b2e0dbef46	refactor(ai): drop version-defensive vector-index check (lancedb is pinned) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:06 -07:00
stumpylogandClaude Opus 4.8	98a5a583f3	refactor(ai): log when the vector-index check fails Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:06 -07:00
stumpylogandClaude Opus 4.8	0421bfcf54	feat(ai): ANN index threshold, scalar index, and compaction Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:06 -07:00
stumpylogandClaude Opus 4.8	fa74bb77b3	fix(ai): upsert empty-nodes path deletes by document_id When upsert_document receives an empty nodes list, delete existing chunks using the document_id column directly (consistent with the merge_insert prune predicate) rather than calling delete() which filters on doc_id. Guard for a missing table to avoid a no-op. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:05 -07:00
stumpylogandClaude Opus 4.8	f0311e77d4	feat(ai): atomic upsert_document on the LanceDB store Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:05 -07:00
stumpylogandClaude Opus 4.8	5cdd9faa56	docs(plan): add Task 13 — pass new AI code through pyrefly Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:05 -07:00
stumpylogandClaude Opus 4.8	0054f5946b	refactor(ai): address review on the LanceDB adapter - Fix delete() to use single-quote delimiter consistent with _escape - Fix _distance comment: L2 not squared-L2 - Fix similarity_top_k zero-guard to use explicit None check - Replace deprecated table_names() with list_tables().tables (lancedb 0.33) - Add add() Sequence[BaseNode] signature with collections.abc.Sequence import - Add test_build_where_or_condition for OR filter branch coverage Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:05 -07:00
stumpylogandClaude Opus 4.8	b758cd1bdb	feat(ai): add LanceDB-backed vector store adapter Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:05 -07:00
stumpylogandClaude Opus 4.8	df4607a492	build: replace faiss-cpu with lancedb for the AI vector store Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:04 -07:00
stumpylogandClaude Opus 4.8	69ca36a16d	Design: Implementation plan for the LanceDB vector store Task-by-task TDD plan implementing the LanceDB design spec: dependency swap, the PaperlessLanceVectorStore adapter, atomic merge_insert upsert, ANN threshold + scalar index + compaction, the indexing/chat/similar rewires, FAISS migration, and a lazy-import guard test so non-AI paths (management commands) never drag in llama_index/lancedb/pyarrow. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:03 -07:00
stumpylogandClaude Opus 4.8	c9ee9edb95	Design: Replace FAISS vector store with LanceDB (custom adapter) Spec for swapping the AI feature's llama-index FAISS StorageContext trio (FaissVectorStore + SimpleDocumentStore + SimpleIndexStore) for LanceDB via a custom BasePydanticVectorStore adapter (no llama-index-vector-stores-lancedb, no pandas). Covers: disk-resident memory-mapped storage, native merge_insert upsert with when_not_matched_by_source_delete, MetadataFilters(IN) filtering on a top-level document_id column, auto IVF ANN threshold (IVF_FLAT fallback), MVCC compaction via optimize(cleanup_older_than=...), migration, concurrency, and testing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 09:41:03 -07:00
Trenton HandGitHub	abdcdccf08	Chore(deps): Silence a couple more vulnerabilities here (#12797 )	2026-06-03 09:28:00 -07:00
shamoonandGitHub	1663ed170c	Enhancement (beta): add direct LLM language setting (#12906 )	2026-06-03 15:53:22 +00:00
shamoonandGitHub	47a6fcfc39	Fix (beta): correctly apply i18n in suggestions dropdown (#12905 )	2026-06-03 08:40:06 -07:00
Trenton H GitHub Claude Sonnet 4.6 shamoon	98dc191194	Fix: Lock AI index during reading and don't index documents many times during a bulk update (#12899 ) * Fix: Move LLM index lock outside index dir and skip per-doc tasks on bulk update Two concurrency bugs from #12893: [P1] Lock file lived inside LLM_INDEX_DIR. A rebuild calls shutil.rmtree(LLM_INDEX_DIR), deleting the lock while a worker still held it. A second worker then acquired a fresh lock on the new path and ran concurrently, defeating serialisation. Move the lock to DATA_DIR/locks/llm_index.lock (a new settings constant LLM_INDEX_LOCK) so rmtree cannot touch it. The locks/ dir is created at settings load time, matching the existing pattern for LOGGING_DIR. [P2] document_updated was connected to add_or_update_document_in_llm_index in apps.py. bulk_update_documents() emits document_updated for every document in the batch, queuing N per-document LLM tasks, and then also calls update_llm_index(rebuild=False) once at the end. Pass skip_ai_index=True when sending document_updated from the bulk path so the handler skips the per-document enqueue; the existing batch call at the end of bulk_update_documents is the only LLM update for that path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: ghost vectors leave KeyError-prone nodes_dict entries after deletion docstore.delete_document() removes a node from the docstore but leaves its entry in index_struct.nodes_dict (the FAISS positional-id to node-UUID map). A subsequent similarity query resolves the ghost position to the deleted UUID, finds nothing in fetched_nodes_by_id, and raises KeyError inside _insert_fetched_nodes_into_query_result. Purge stale nodes_dict entries after each docstore deletion and re-sync the mutated index_struct into the kvstore so persist() writes the updated mapping. Dead FAISS vectors remain in the flat index until the next full rebuild (IndexFlatL2 is append-only); add a try/except KeyError around retriever.retrieve() as a defensive fallback for any residual ghost positions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: acquire index lock in query_similar_documents query_similar_documents() loaded the index and ran the FAISS retriever without holding the file lock. All write paths (update_llm_index, llm_index_add_or_update_document, llm_index_remove_document) hold FileLock(_index_lock_path()), so a concurrent rebuild calling shutil.rmtree(LLM_INDEX_DIR) while a read is mid-load produces an IOError or corrupt partial state. Wrap the load_or_build_index() call and all subsequent retriever work inside FileLock. The early-return guards (vector_store_file_exists check, empty allowed_document_ids) remain outside the lock; the DB query for the final result set also stays outside. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: skip LLM index enqueue on document_updated during version addition When a document is consumed as a new version of an existing document, the consumer fires document_consumption_finished (which triggers add_or_update_document_in_llm_index) and then document_updated for the root document. Both signals are connected to the same handler, so the root document was enqueued for LLM indexing twice per version-addition event. Pass skip_ai_index=True on the consumer's version-addition document_updated send so the handler's existing guard suppresses the duplicate enqueue. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test: bulk_update_documents must not enqueue per-doc LLM tasks With AI enabled, bulk_update_documents() sends document_updated for every document in the batch. The skip_ai_index=True kwarg (added in the P2 fix) prevents add_or_update_document_in_llm_index from enqueuing a per-document task for each one. Only the single update_llm_index call at the end should run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Debug level log sure * Update src/paperless_ai/indexing.py Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com> * Apply suggestion from @shamoon --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-06-02 10:46:29 -07:00
Trenton H GitHub Claude Sonnet 4.6	2c58d86380	Fix: Minor fixes for the AI indexing (#12893 ) * Fix: Remove all nodes for multi-chunk documents in update_llm_index incremental path The existing_nodes dict comprehension keyed on document_id silently dropped all but the last node per document, so only that one node was deleted when a modified document was re-indexed, leaving all other chunks as ghost vectors in the FAISS index. Switch to a defaultdict(list) that collects every node per document_id, then iterate and delete all of them before inserting fresh nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Wire document_updated signal to LLM index update handler Connect document_updated to add_or_update_document_in_llm_index in DocumentsConfig.ready() so REST API edits (PATCH /api/documents/{id}/) enqueue an LLM vector store update, matching the existing document_consumption_finished behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Add file lock around FAISS index mutations to prevent concurrent write corruption Two concurrent Celery workers calling llm_index_add_or_update_document or llm_index_remove_document each loaded the same on-disk index independently, made their own change, and the last writer silently overwrote the first's update. Wrap both functions and the rebuild/persist body of update_llm_index in a filelock.FileLock keyed on LLM_INDEX_DIR/index.lock. Add a TOCTOU comment on queue_llm_index_update_if_needed explaining the residual risk (duplicate rebuild tasks are wasteful but not corrupting because the lock serialises the actual write). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Apply _normalize() in extract_unmatched_names to prevent duplicate suggestions extract_unmatched_names was using .lower() while _match_names_to_queryset uses _normalize() (which also strips punctuation). A name like "J. Smith" matched to existing correspondent "J Smith" would still appear in the unmatched list, causing duplicate object creation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Skip LLM index update gracefully when document has no indexable content Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Persist empty index when all documents are deleted to clear stale FAISS vectors The early-return guard in update_llm_index fired before persist() when no documents existed, leaving a stale on-disk FAISS index that returned phantom hits for deleted document IDs. Now the guard only returns early for the incremental (rebuild=False) path when no index exists on disk; the rebuild path always continues through to persist(), producing an empty clean index. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore: Simplify incremental index update — use docs.values() and deduplicate node extend --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 13:40:49 -07:00
shamoonandGitHub	52222d23d3	Fix (beta): dont use tool calling with ollama (#12896 )	2026-06-01 12:12:23 -07:00
shamoonandGitHub	27426c04b0	Enhancement: try to respect language for AI suggestions (#12894 )	2026-06-01 12:11:46 -07:00
shamoonandGitHub	f6c865bf47	Enhancement: AI LLM chunk size and context window config (#12891 )	2026-06-01 17:56:21 +00:00
Trenton H GitHub Claude Sonnet 4.6	bb860a5834	Fix: Improvements for security around the AI (#12895 ) * Fix: Validate and limit chat question input in ChatStreamingView Add max_length=4000 to ChatStreamingSerializer.q and replace the bare request.data["q"] read with proper serializer.is_valid(raise_exception=True) so oversized or missing questions are rejected with HTTP 400 before reaching the LLM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Add defensive prompt framing to mark document content as untrusted * Also adds a system prompt which is treated higher that this is untrusted stuff --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 10:03:27 -07:00
Trenton HandGitHub	889ccfd67a	Fix: Fold query and autocomplete terms with Tantivy's ascii_fold so special letters match (#12868 )	2026-05-29 16:42:07 -07:00
Trenton HandGitHub	bbceb5dac6	Fix: Don't store autocomplete_word, only index it (#12867 )	2026-05-29 14:09:04 -07:00
Trenton HandGitHub	98a7ed32e3	Fix: Preserve Whoosh date range swapping in Tantviy (#12866 )	2026-05-29 20:21:59 +00:00
Trenton HandGitHub	25a7b2038a	Fix: Always release search index writer, even on failure, so the write lock doesn't persist for later (#12865 )	2026-05-29 19:38:58 +00:00
Trenton HandGitHub	97e3c75720	Fix: Handle CJK title, content and metadata searching (#12862 )	2026-05-29 19:11:55 +00:00
Trenton HandGitHub	11c62757ef	Fix: Restrict date query rewrites to date or datetime fields only (#12864 )	2026-05-29 11:59:30 -07:00
Trenton HandGitHub	4a8d79be6f	Fix: Missing call to tanvity wait_merging_threads (#12863 )	2026-05-29 10:32:15 -07:00
Trenton HandGitHub	525b986e23	Fix: Handle tanvity index lock contention (#12856 ) Implements and tests a retry with backoff + jitter for aquring the index update lock. If we still can't get it, dispatch a celery task to handle it later instead (also with retry) Signed-off-by: stumpylog <797416+stumpylog@users.noreply.github.com>	2026-05-27 09:47:13 -07:00
shamoonandGitHub	4ce5f2022c	Fix (beta): better catch chat errors (#12854 )	2026-05-26 19:05:47 +00:00
shamoonandGitHub	ab47185712	Performance (beta): dont re-build vector index with each chat (#12847 )	2026-05-26 11:36:05 -07:00
shamoonandGitHub	01d8fad622	Security: fixes for v3 beta (#12838 )	2026-05-26 16:46:23 +00:00
shamoonandGitHub	da3e845b8b	Fix (beta): normalize long punctuation chunks to improve embedding (#12848 )	2026-05-26 09:32:38 -07:00
shamoonandGitHub	0a6e0db186	Fix: use chord.on_error before apply_async (#12842 )	2026-05-24 14:42:11 -07:00
shamoon	15682231b2	Chore: fix sonarcube logger warnings	2026-05-20 08:54:00 -07:00
Trenton HandGitHub	df861189fa	Fix: Don't use smaller integer fields for some workflow fields (#12834 )	2026-05-20 14:39:01 +00:00
Trenton H GitHub shamoon	bd86dca57e	Fix: Password removal source file location (#12830 ) Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-05-19 13:52:04 -07:00
Trenton HandGitHub	9f45737b94	Upgrades this dep so it handles newer models, like gpt-5-5 which require a locked 1.0 temperature value (#12824 )	2026-05-18 12:30:03 -07:00
shamoonandGitHub	83d59ad3bf	Fix (beta): use correct html button type for custom field buttons (#12819 )	2026-05-17 19:15:03 -07:00
Trenton HandGitHub	ff3360310b	Fix: Defer password removal workflow action until the file is in place (#12814 )	2026-05-16 17:14:37 -07:00
Trenton HandGitHub	9a68dcdddf	Fix: Allow setting allauth rate limit configuration settings (#12798 )	2026-05-14 07:29:49 -07:00
Trenton HandGitHub	9a78882b5a	Fix: Don't embed the metadata which is already embedded into the context (#12795 )	2026-05-13 09:01:34 -07:00
Trenton HandGitHub	7e381f204e	Fix: Sanitize dash or plus from the text search path (#12789 )	2026-05-12 12:41:38 -07:00
shamoon	5f42854d99	Fix: two more css tweaks to tasks page	2026-05-11 13:50:02 -07:00

1 2 3 4 5 ...