paperless-ngx

mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-06-21 12:54:22 +00:00

Author	SHA1	Message	Date
shamoon	bcf5d2cffc	Chore: set tool_required to opena-like llm calls (#13025 )	2026-06-17 06:24:38 -07:00
Trenton H	ad1b54ce88	Fix (beta): Catch consumer files created during watcher re-creations (#13013 )	2026-06-15 19:23:54 -07:00
Trenton H	f4fa916579	Fix (beta): restore v2 (Whoosh) advanced-search query compatibility (#13010 ) Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-15 15:32:44 -07:00
shamoon	75f0c4c92e	Fix (beta): retry celery ping and report warning on no response (#13012 )	2026-06-15 15:05:43 -07:00
Trenton H	a020f64d08	Enhancement(beta): replace LanceDB vector store with sqlite-vec (#12990 ) * Chore(beta): add sqlite-vec 0.1.9 dependency Pinned exactly: the 0.1.9 wheels carry no baked SIMD flags (safe on pre-AVX2 CPUs, the point of this migration); the 0.1.10 alphas bake -mavx and would reintroduce the #12970 crash class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test(beta): port vector store tests to sqlite-vec backend Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): switch AI vector store from LanceDB to sqlite-vec Fixes the non-AVX2 SIGILL class (#12970) at the root: lancedb is no longer imported. sqlite-vec 0.1.9 wheels carry no baked SIMD, vec0 metadata columns give parameterized EQ/IN filtering, WAL preserves the lock-free-reader model, and compact() rebuilds the table because vec0 DELETEs never reclaim space. Implementation notes vs. the Task 3A draft: - compact() uses a file-swap approach (new db file + Path.replace) rather than ALTER TABLE RENAME, which does not cascade to shadow tables in sqlite-vec 0.1.9 (upstream limitation). - Bloat is tracked via a cumulative total_inserts counter in index_meta because the _rowids shadow table does not accumulate deleted rows in 0.1.9 (contrary to the design doc assumption from #54). - None distances from the zero-vector cosine edge case are mapped to similarity 0.0 rather than raising TypeError. - Test suite updated accordingly: _bloat_ratio reads index_meta instead of _rowids; seed collision in force-compact test fixed (seed=100.0). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): wire indexing pipeline to the sqlite-vec store Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Enhancement(beta): move filename/storage path/ASN to node metadata Same treatment as title/tags/correspondent in #12944: excluded from the embedded text, visible to the LLM via metadata prepend. Changes embedded text for every document, so it ships inside the sqlite-vec transition, whose forced rebuild re-embeds everything anyway. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test(beta): cover legacy LanceDB index cleanup and forced rebuild Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): drop lancedb dependency Fixes #12970: the package whose wheels SIGILL on non-AVX2 CPUs is no longer installed at all. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): partial pyrefly cleanup on sqlite-vec vector store - Add MetadataFilter import and isinstance guard in _build_where() - Add query_embedding None guard in query() - Fix dict.get() type-checker ambiguity in get_configured_model_name() Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): drop automatic LanceDB index cleanup on startup Leave legacy Lance directory removal to the user rather than deleting it automatically on first run. Beta policy: user is expected to do a clean re-embed anyway; no need for the system to silently delete their data. Remove _cleanup_legacy_lance_index(), the forced-rebuild path that called it, and the associated tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): ruff format pass on sqlite-vec AI files Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Removes the benchmarking file * Try to resolve or silence some semgrep. But we're using SQL here, not an ORM and we control the inputs, not users * Enhancement(beta): add schema migration machinery to sqlite-vec vector store Adds versioned schema migration support modelled after PR #12968's LanceDB approach, adapted for sqlite-vec's file-swap compaction pattern. - SCHEMA_VERSION = 1 written to index_meta at table creation and preserved through compact() - Migration dataclass with from_version, to_version, kind ("structural" or "re-embed"), description, and an optional apply(src, dst, dim) callable - MIGRATIONS registry (empty at v1 baseline); add entries and bump SCHEMA_VERSION when the schema changes - check_and_run_migrations(): structural migrations run via the same file-swap as compact() (no re-embed); re-embed migrations return True so the caller forces a full rebuild - update_llm_index() calls check_and_run_migrations() under the write lock before any indexing work Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore(beta): deduplicate vector store internals via helper methods Extract three helpers to remove copy-paste between compact() and _run_structural_migration(): - _meta_set_on(conn, key, value): static upsert into any connection's index_meta; _meta_set() now delegates to it - _create_vec_table(conn, dim): CREATE VIRTUAL TABLE DDL (carries the nosemgrep annotation) - _swap_in_compact(compact_path, db_path): close/replace/reconnect sequence used by both file-swap callers Also normalises compact() error-path cleanup to unlink(missing_ok=True). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Adds equality test and no covers some defensive error handling stuff * Ensures an embed migration stops the migration chain, just in case * Silence one kind right but not really semgrep * Trims dead assignment * Fix(beta): address Copilot review on sqlite-vec vector store Three findings from the PR review: - compact() failure cleanup now unlinks the temporary .compact-wal and .compact-shm files, matching _run_structural_migration(); previously only the main .compact file was removed. - _build_where() fails closed (1 = 0) when filters are requested but none translate, instead of emitting "()" which is invalid SQL; filters scope document access, so an empty translation must match no rows. - Drop the unused table_name constructor parameter (all SQL hardcodes DEFAULT_TABLE_NAME) and its callers in indexing.py. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Enhancement(beta): guard sqlite-vec compaction swap against concurrent readers The compaction/migration file swap replaces the database via os.replace, but the -wal/-shm files are keyed by path, not inode. A reader holding an open connection across the swap leaves the old WAL aliased onto the new file; a subsequent write then corrupts the database (reproduced via PRAGMA integrity_check). Add a cross-process read/write lock (filelock.ReadWriteLock) over the index: - read_store() holds it shared for the whole connection lifetime (and closes the connection on exit); concurrent readers do not block. - compaction and the migration check run under an exclusive lock that drains readers, and skip with an info log on Timeout (maintenance op, retries next run). - Normal writes are untouched: WAL gives reader/writer concurrency and LLM_INDEX_LOCK still serializes writers, so they never block readers. load_or_build_index() now takes the store from the caller's read_store() so the lock and connection span the whole retrieval; chat holds it across the streamed response. Two new settings: LLM_INDEX_RWLOCK and LLM_INDEX_COMPACTION_LOCK_TIMEOUT. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * Ensures the store alays cleans up SQLite connections for any operations, even on errors --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-15 13:20:41 -07:00
Trenton H	8ed4bf2011	Fix: Apply unicode normalization to all paths and path components (#12993 )	2026-06-13 12:45:54 +00:00
Trenton H	92c016ce47	Fix: Handle the UTF 16 and BOM text files better (#12994 )	2026-06-13 05:35:38 -07:00
shamoon	fb3816486c	Fix (beta): avoid DRF update calling `save` on all fields (#12992 )	2026-06-12 11:14:26 -07:00
Trenton H	4394403beb	Fix: release pooled DB connection during AI LLM/embedding calls (#12983 )	2026-06-11 13:07:31 -07:00
Trenton H	f188d308eb	Fix: health-check pooled DB connections and close the pool on worker shutdown (#12977 )	2026-06-11 05:49:10 -07:00
shamoon	c3459d8f62	Fix (beta): move task filtering to backend fully (#12956 )	2026-06-07 22:45:15 +00:00
shamoon	6f8e39c2e0	Fix: avoid unnecessary creating new PDF with pw removal workflow (#12948 )	2026-06-07 20:30:08 +00:00
Trenton H	eb292baa69	Enhancement (beta): Switch the AI vector store to LanceDB (#12944 ) Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: shamoon <shamoon@users.noreply.github.com>	2026-06-07 11:31:26 -07:00
shamoon	3d0b8343b9	Fixhancement (beta): tasks dismiss all (#12949 )	2026-06-07 03:42:06 +00:00
shamoon	449fd97b1f	Fix (beta): respect disable state for suggest endpoint, require change perms (#12942 )	2026-06-05 14:16:53 +00:00
Trenton H	fa0c4368d7	Fix: Ensure checksum comparison is using SHA256 in file handling (#12939 )	2026-06-05 06:46:45 -07:00
shamoon	289d797837	Merge branch 'dev' into beta	2026-06-03 15:12:44 -07:00
Trenton H	7ef6ba69e6	Fix: Validate the AI backend settings earlier instead of crashing inside the AI module (#12903 )	2026-06-03 12:16:09 -07:00
shamoon	1663ed170c	Enhancement (beta): add direct LLM language setting (#12906 )	2026-06-03 15:53:22 +00:00
Trenton H	98dc191194	Fix: Lock AI index during reading and don't index documents many times during a bulk update (#12899 ) * Fix: Move LLM index lock outside index dir and skip per-doc tasks on bulk update Two concurrency bugs from #12893: [P1] Lock file lived inside LLM_INDEX_DIR. A rebuild calls shutil.rmtree(LLM_INDEX_DIR), deleting the lock while a worker still held it. A second worker then acquired a fresh lock on the new path and ran concurrently, defeating serialisation. Move the lock to DATA_DIR/locks/llm_index.lock (a new settings constant LLM_INDEX_LOCK) so rmtree cannot touch it. The locks/ dir is created at settings load time, matching the existing pattern for LOGGING_DIR. [P2] document_updated was connected to add_or_update_document_in_llm_index in apps.py. bulk_update_documents() emits document_updated for every document in the batch, queuing N per-document LLM tasks, and then also calls update_llm_index(rebuild=False) once at the end. Pass skip_ai_index=True when sending document_updated from the bulk path so the handler skips the per-document enqueue; the existing batch call at the end of bulk_update_documents is the only LLM update for that path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: ghost vectors leave KeyError-prone nodes_dict entries after deletion docstore.delete_document() removes a node from the docstore but leaves its entry in index_struct.nodes_dict (the FAISS positional-id to node-UUID map). A subsequent similarity query resolves the ghost position to the deleted UUID, finds nothing in fetched_nodes_by_id, and raises KeyError inside _insert_fetched_nodes_into_query_result. Purge stale nodes_dict entries after each docstore deletion and re-sync the mutated index_struct into the kvstore so persist() writes the updated mapping. Dead FAISS vectors remain in the flat index until the next full rebuild (IndexFlatL2 is append-only); add a try/except KeyError around retriever.retrieve() as a defensive fallback for any residual ghost positions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: acquire index lock in query_similar_documents query_similar_documents() loaded the index and ran the FAISS retriever without holding the file lock. All write paths (update_llm_index, llm_index_add_or_update_document, llm_index_remove_document) hold FileLock(_index_lock_path()), so a concurrent rebuild calling shutil.rmtree(LLM_INDEX_DIR) while a read is mid-load produces an IOError or corrupt partial state. Wrap the load_or_build_index() call and all subsequent retriever work inside FileLock. The early-return guards (vector_store_file_exists check, empty allowed_document_ids) remain outside the lock; the DB query for the final result set also stays outside. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: skip LLM index enqueue on document_updated during version addition When a document is consumed as a new version of an existing document, the consumer fires document_consumption_finished (which triggers add_or_update_document_in_llm_index) and then document_updated for the root document. Both signals are connected to the same handler, so the root document was enqueued for LLM indexing twice per version-addition event. Pass skip_ai_index=True on the consumer's version-addition document_updated send so the handler's existing guard suppresses the duplicate enqueue. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Test: bulk_update_documents must not enqueue per-doc LLM tasks With AI enabled, bulk_update_documents() sends document_updated for every document in the batch. The skip_ai_index=True kwarg (added in the P2 fix) prevents add_or_update_document_in_llm_index from enqueuing a per-document task for each one. Only the single update_llm_index call at the end should run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Debug level log sure * Update src/paperless_ai/indexing.py Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com> * Apply suggestion from @shamoon --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-06-02 10:46:29 -07:00
GitHub Actions	9c1649f1ac	Auto translate strings	2026-06-02 15:34:49 +00:00
shamoon	ab8fe0521b	Merge branch 'beta' into dev	2026-06-02 08:32:54 -07:00
shamoon	2638554969	Merge branch 'main' into dev	2026-06-02 08:32:43 -07:00
Trenton H	2c58d86380	Fix: Minor fixes for the AI indexing (#12893 ) * Fix: Remove all nodes for multi-chunk documents in update_llm_index incremental path The existing_nodes dict comprehension keyed on document_id silently dropped all but the last node per document, so only that one node was deleted when a modified document was re-indexed, leaving all other chunks as ghost vectors in the FAISS index. Switch to a defaultdict(list) that collects every node per document_id, then iterate and delete all of them before inserting fresh nodes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Wire document_updated signal to LLM index update handler Connect document_updated to add_or_update_document_in_llm_index in DocumentsConfig.ready() so REST API edits (PATCH /api/documents/{id}/) enqueue an LLM vector store update, matching the existing document_consumption_finished behavior. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Add file lock around FAISS index mutations to prevent concurrent write corruption Two concurrent Celery workers calling llm_index_add_or_update_document or llm_index_remove_document each loaded the same on-disk index independently, made their own change, and the last writer silently overwrote the first's update. Wrap both functions and the rebuild/persist body of update_llm_index in a filelock.FileLock keyed on LLM_INDEX_DIR/index.lock. Add a TOCTOU comment on queue_llm_index_update_if_needed explaining the residual risk (duplicate rebuild tasks are wasteful but not corrupting because the lock serialises the actual write). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Apply _normalize() in extract_unmatched_names to prevent duplicate suggestions extract_unmatched_names was using .lower() while _match_names_to_queryset uses _normalize() (which also strips punctuation). A name like "J. Smith" matched to existing correspondent "J Smith" would still appear in the unmatched list, causing duplicate object creation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Skip LLM index update gracefully when document has no indexable content Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Persist empty index when all documents are deleted to clear stale FAISS vectors The early-return guard in update_llm_index fired before persist() when no documents existed, leaving a stale on-disk FAISS index that returned phantom hits for deleted document IDs. Now the guard only returns early for the incremental (rebuild=False) path when no index exists on disk; the rebuild path always continues through to persist(), producing an empty clean index. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Chore: Simplify incremental index update — use docs.values() and deduplicate node extend --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 13:40:49 -07:00
shamoon	52222d23d3	Fix (beta): dont use tool calling with ollama (#12896 )	2026-06-01 12:12:23 -07:00
shamoon	27426c04b0	Enhancement: try to respect language for AI suggestions (#12894 )	2026-06-01 12:11:46 -07:00
shamoon	f6c865bf47	Enhancement: AI LLM chunk size and context window config (#12891 )	2026-06-01 17:56:21 +00:00
Trenton H	bb860a5834	Fix: Improvements for security around the AI (#12895 ) * Fix: Validate and limit chat question input in ChatStreamingView Add max_length=4000 to ChatStreamingSerializer.q and replace the bare request.data["q"] read with proper serializer.is_valid(raise_exception=True) so oversized or missing questions are rejected with HTTP 400 before reaching the LLM. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * Fix: Add defensive prompt framing to mark document content as untrusted * Also adds a system prompt which is treated higher that this is untrusted stuff --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-06-01 10:03:27 -07:00
Trenton H	889ccfd67a	Fix: Fold query and autocomplete terms with Tantivy's ascii_fold so special letters match (#12868 )	2026-05-29 16:42:07 -07:00
Trenton H	bbceb5dac6	Fix: Don't store autocomplete_word, only index it (#12867 )	2026-05-29 14:09:04 -07:00
Trenton H	98a7ed32e3	Fix: Preserve Whoosh date range swapping in Tantviy (#12866 )	2026-05-29 20:21:59 +00:00
Trenton H	25a7b2038a	Fix: Always release search index writer, even on failure, so the write lock doesn't persist for later (#12865 )	2026-05-29 19:38:58 +00:00
Trenton H	97e3c75720	Fix: Handle CJK title, content and metadata searching (#12862 )	2026-05-29 19:11:55 +00:00
Trenton H	11c62757ef	Fix: Restrict date query rewrites to date or datetime fields only (#12864 )	2026-05-29 11:59:30 -07:00
Trenton H	4a8d79be6f	Fix: Missing call to tanvity wait_merging_threads (#12863 )	2026-05-29 10:32:15 -07:00
Trenton H	525b986e23	Fix: Handle tanvity index lock contention (#12856 ) Implements and tests a retry with backoff + jitter for aquring the index update lock. If we still can't get it, dispatch a celery task to handle it later instead (also with retry) Signed-off-by: stumpylog <797416+stumpylog@users.noreply.github.com>	2026-05-27 09:47:13 -07:00
shamoon	4ce5f2022c	Fix (beta): better catch chat errors (#12854 )	2026-05-26 19:05:47 +00:00
shamoon	ab47185712	Performance (beta): dont re-build vector index with each chat (#12847 )	2026-05-26 11:36:05 -07:00
shamoon	01d8fad622	Security: fixes for v3 beta (#12838 )	2026-05-26 16:46:23 +00:00
shamoon	da3e845b8b	Fix (beta): normalize long punctuation chunks to improve embedding (#12848 )	2026-05-26 09:32:38 -07:00
Matt Van Horn	45ba35dd3a	docs: remove duplicate words in three files (#12852 )	2026-05-26 06:40:30 -07:00
shamoon	0a6e0db186	Fix: use chord.on_error before apply_async (#12842 )	2026-05-24 14:42:11 -07:00
shamoon	15682231b2	Chore: fix sonarcube logger warnings	2026-05-20 08:54:00 -07:00
Trenton H	df861189fa	Fix: Don't use smaller integer fields for some workflow fields (#12834 )	2026-05-20 14:39:01 +00:00
Trenton H	bd86dca57e	Fix: Password removal source file location (#12830 ) Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>	2026-05-19 13:52:04 -07:00
Trenton H	ff3360310b	Fix: Defer password removal workflow action until the file is in place (#12814 )	2026-05-16 17:14:37 -07:00
Trenton H	9a68dcdddf	Fix: Allow setting allauth rate limit configuration settings (#12798 )	2026-05-14 07:29:49 -07:00
Trenton H	9a78882b5a	Fix: Don't embed the metadata which is already embedded into the context (#12795 )	2026-05-13 09:01:34 -07:00
Trenton H	7e381f204e	Fix: Sanitize dash or plus from the text search path (#12789 )	2026-05-12 12:41:38 -07:00
shamoon	7471fedb43	Fix: Update parser contract to require empty strings, not None (#12775 ) Co-authored-by: stumpylog <797416+stumpylog@users.noreply.github.com>	2026-05-11 09:16:21 -07:00

1 2 3 4 5 ...

4072 Commits