Files
paperless-ngx/src
Trenton H 1a26514a96 perf: replace MLPClassifier with LinearSVC for multi-tag classification
For the common case (num_tags > 1), switch from MLPClassifier to
OneVsRestClassifier(LinearSVC()) for the tags classifier.

MLPClassifier with thousands of output neurons (e.g. 3,085 AUTO tags)
requires a dense num_docs x num_tags label matrix and runs full
gradient descent with Adam optimiser for up to 200 epochs -- the
primary cause of >10 GB RAM and multi-hour training in extreme cases.

LinearSVC trains one binary linear SVM per class via OneVsRestClassifier.
Each model is a single weight vector; training is parallelisable and
orders of magnitude faster for large class counts.

The num_tags == 1 binary path is unchanged (MLP is kept there because
LinearSVC requires at least 2 distinct classes in training data, which
is not guaranteed when all documents share the single AUTO tag).

Adds test_classifier_tags_correctness.py, which verifies:
- Multi-cluster docs are predicted correctly (single and multi-tag)
- Single-tag (binary) path is predicted correctly
- Test passes with MLP (baseline) and LinearSVC (after swap)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 10:18:16 -07:00
..
2026-04-06 22:51:57 +00:00
2023-04-26 09:35:27 -07:00