mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-04-20 14:59:27 +00:00
For the common case (num_tags > 1), switch from MLPClassifier to OneVsRestClassifier(LinearSVC()) for the tags classifier. MLPClassifier with thousands of output neurons (e.g. 3,085 AUTO tags) requires a dense num_docs x num_tags label matrix and runs full gradient descent with Adam optimiser for up to 200 epochs -- the primary cause of >10 GB RAM and multi-hour training in extreme cases. LinearSVC trains one binary linear SVM per class via OneVsRestClassifier. Each model is a single weight vector; training is parallelisable and orders of magnitude faster for large class counts. The num_tags == 1 binary path is unchanged (MLP is kept there because LinearSVC requires at least 2 distinct classes in training data, which is not guaranteed when all documents share the single AUTO tag). Adds test_classifier_tags_correctness.py, which verifies: - Multi-cluster docs are predicted correctly (single and multi-tag) - Single-tag (binary) path is predicted correctly - Test passes with MLP (baseline) and LinearSVC (after swap) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>