mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-04-10 01:58:53 +00:00
Add a fast-skip gate at the top of DocumentClassifier.train() that returns False after at most 5 DB queries (1x MAX(modified) on non-inbox docs + 4x MATCH_AUTO pk lists), avoiding the O(N) per-document label scan on no-op calls. Previously the classifier always iterated every document to build the label hash before it could decide to skip — ~8 s at 5k docs, scaling linearly. Changes: - FORMAT_VERSION 10 -> 11 (new field in pickle) - New field `last_auto_label_set_digest` stored after each full train - New static method `_compute_auto_label_set_digest()` (4 queries) - Fast-skip block before the document queryset; mirrors the inbox-tag exclusion used by the training queryset for an apples-to-apples MAX(modified) comparison - Remove old embedded skip check (after the full label scan) which had a correctness gap: MATCH_AUTO labels with no document assignments were invisible to the per-doc hash, so a new unassigned AUTO label would not trigger a retrain Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>