Files
paperless-ngx/src
Trenton H 68b866aeee perf: fast skip in classifier train() via auto-label-set digest
Add a fast-skip gate at the top of DocumentClassifier.train() that
returns False after at most 5 DB queries (1x MAX(modified) on
non-inbox docs + 4x MATCH_AUTO pk lists), avoiding the O(N)
per-document label scan on no-op calls.

Previously the classifier always iterated every document to build the
label hash before it could decide to skip — ~8 s at 5k docs, scaling
linearly.

Changes:
- FORMAT_VERSION 10 -> 11 (new field in pickle)
- New field `last_auto_label_set_digest` stored after each full train
- New static method `_compute_auto_label_set_digest()` (4 queries)
- Fast-skip block before the document queryset; mirrors the inbox-tag
  exclusion used by the training queryset for an apples-to-apples
  MAX(modified) comparison
- Remove old embedded skip check (after the full label scan) which had
  a correctness gap: MATCH_AUTO labels with no document assignments
  were invisible to the per-doc hash, so a new unassigned AUTO label
  would not trigger a retrain

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 09:30:16 -07:00
..
2026-04-06 22:51:57 +00:00
2023-04-26 09:35:27 -07:00