Compare commits

..

6 Commits

Author SHA1 Message Date
Trenton H
1a26514a96 perf: replace MLPClassifier with LinearSVC for multi-tag classification
For the common case (num_tags > 1), switch from MLPClassifier to
OneVsRestClassifier(LinearSVC()) for the tags classifier.

MLPClassifier with thousands of output neurons (e.g. 3,085 AUTO tags)
requires a dense num_docs x num_tags label matrix and runs full
gradient descent with Adam optimiser for up to 200 epochs -- the
primary cause of >10 GB RAM and multi-hour training in extreme cases.

LinearSVC trains one binary linear SVM per class via OneVsRestClassifier.
Each model is a single weight vector; training is parallelisable and
orders of magnitude faster for large class counts.

The num_tags == 1 binary path is unchanged (MLP is kept there because
LinearSVC requires at least 2 distinct classes in training data, which
is not guaranteed when all documents share the single AUTO tag).

Adds test_classifier_tags_correctness.py, which verifies:
- Multi-cluster docs are predicted correctly (single and multi-tag)
- Single-tag (binary) path is predicted correctly
- Test passes with MLP (baseline) and LinearSVC (after swap)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 10:18:16 -07:00
Trenton H
1fefd506b7 perf: eliminate second document queryset scan in classifier train()
Capture doc.content during the label extraction loop so the document
queryset is iterated exactly once per training run.

Previously CountVectorizer.fit_transform() consumed a content_generator()
that re-evaluated the same docs_queryset, causing a second full table
scan. At 5k docs this wasted ~2.4 s and doubled DB I/O on every train.

Remove content_generator(); replace with a generator expression over
the in-memory doc_contents list collected during Step 1.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 09:38:53 -07:00
Trenton H
68b866aeee perf: fast skip in classifier train() via auto-label-set digest
Add a fast-skip gate at the top of DocumentClassifier.train() that
returns False after at most 5 DB queries (1x MAX(modified) on
non-inbox docs + 4x MATCH_AUTO pk lists), avoiding the O(N)
per-document label scan on no-op calls.

Previously the classifier always iterated every document to build the
label hash before it could decide to skip — ~8 s at 5k docs, scaling
linearly.

Changes:
- FORMAT_VERSION 10 -> 11 (new field in pickle)
- New field `last_auto_label_set_digest` stored after each full train
- New static method `_compute_auto_label_set_digest()` (4 queries)
- Fast-skip block before the document queryset; mirrors the inbox-tag
  exclusion used by the training queryset for an apples-to-apples
  MAX(modified) comparison
- Remove old embedded skip check (after the full label scan) which had
  a correctness gap: MATCH_AUTO labels with no document assignments
  were invisible to the per-doc hash, so a new unassigned AUTO label
  would not trigger a retrain

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-08 09:30:16 -07:00
Trenton H
a5fe88d2a1 Chore: Resolves some zizmor reported code scan findings (#12516)
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-04-06 23:03:29 +00:00
GitHub Actions
51c59746a7 Auto translate strings 2026-04-06 22:51:57 +00:00
Trenton H
c232d443fa Breaking: Decouple OCR control from archive file control (#12448)
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
2026-04-06 15:50:21 -07:00
58 changed files with 2935 additions and 443 deletions

View File

@@ -13,10 +13,13 @@ concurrency:
env:
DEFAULT_UV_VERSION: "0.10.x"
NLTK_DATA: "/usr/share/nltk_data"
permissions: {}
jobs:
changes:
name: Detect Backend Changes
runs-on: ubuntu-slim
permissions:
contents: read
outputs:
backend_changed: ${{ steps.force.outputs.run_all == 'true' || steps.filter.outputs.backend == 'true' }}
steps:
@@ -66,6 +69,8 @@ jobs:
if: needs.changes.outputs.backend_changed == 'true'
name: "Python ${{ matrix.python-version }}"
runs-on: ubuntu-24.04
permissions:
contents: read
strategy:
matrix:
python-version: ['3.11', '3.12', '3.13', '3.14']
@@ -143,6 +148,8 @@ jobs:
if: needs.changes.outputs.backend_changed == 'true'
name: Check project typing
runs-on: ubuntu-24.04
permissions:
contents: read
env:
DEFAULT_PYTHON: "3.12"
steps:

View File

@@ -89,7 +89,7 @@ jobs:
push_external="true"
;;
esac
case "${{ github.ref }}" in
case "${GITHUB_REF}" in
refs/tags/v*|*beta.rc*)
push_external="true"
;;
@@ -230,8 +230,10 @@ jobs:
docker buildx imagetools create ${tags} ${digests}
- name: Inspect image
env:
FIRST_TAG: ${{ fromJSON(steps.docker-meta.outputs.json).tags[0] }}
run: |
docker buildx imagetools inspect ${{ fromJSON(steps.docker-meta.outputs.json).tags[0] }}
docker buildx imagetools inspect "${FIRST_TAG}"
- name: Copy to Docker Hub
if: needs.build-arch.outputs.push-external == 'true'
env:

View File

@@ -10,8 +10,6 @@ concurrency:
cancel-in-progress: true
permissions:
contents: read
pages: write
id-token: write
env:
DEFAULT_UV_VERSION: "0.10.x"
DEFAULT_PYTHON_VERSION: "3.12"
@@ -105,6 +103,9 @@ jobs:
needs: [changes, build]
if: github.event_name == 'push' && github.ref == 'refs/heads/main' && needs.changes.outputs.docs_changed == 'true'
runs-on: ubuntu-24.04
permissions:
pages: write
id-token: write
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}

View File

@@ -10,10 +10,13 @@ on:
concurrency:
group: frontend-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
permissions: {}
jobs:
changes:
name: Detect Frontend Changes
runs-on: ubuntu-slim
permissions:
contents: read
outputs:
frontend_changed: ${{ steps.force.outputs.run_all == 'true' || steps.filter.outputs.frontend == 'true' }}
steps:
@@ -21,6 +24,7 @@ jobs:
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
persist-credentials: false
- name: Decide run mode
id: force
run: |
@@ -59,6 +63,8 @@ jobs:
if: needs.changes.outputs.frontend_changed == 'true'
name: Install Dependencies
runs-on: ubuntu-24.04
permissions:
contents: read
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
@@ -89,6 +95,8 @@ jobs:
needs: [changes, install-dependencies]
if: needs.changes.outputs.frontend_changed == 'true'
runs-on: ubuntu-24.04
permissions:
contents: read
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
@@ -120,6 +128,8 @@ jobs:
needs: [changes, install-dependencies]
if: needs.changes.outputs.frontend_changed == 'true'
runs-on: ubuntu-24.04
permissions:
contents: read
strategy:
fail-fast: false
matrix:
@@ -169,6 +179,8 @@ jobs:
needs: [changes, install-dependencies]
if: needs.changes.outputs.frontend_changed == 'true'
runs-on: ubuntu-24.04
permissions:
contents: read
container: mcr.microsoft.com/playwright:v1.58.2-noble
env:
PLAYWRIGHT_BROWSERS_PATH: /ms-playwright
@@ -212,6 +224,8 @@ jobs:
needs: [changes, unit-tests, e2e-tests]
if: needs.changes.outputs.frontend_changed == 'true'
runs-on: ubuntu-24.04
permissions:
contents: read
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2

View File

@@ -9,6 +9,8 @@ on:
concurrency:
group: lint-${{ github.event.pull_request.number || github.ref }}
cancel-in-progress: true
permissions:
contents: read
jobs:
lint:
name: Linting via prek

View File

@@ -10,10 +10,14 @@ concurrency:
env:
DEFAULT_UV_VERSION: "0.10.x"
DEFAULT_PYTHON_VERSION: "3.12"
permissions: {}
jobs:
wait-for-docker:
name: Wait for Docker Build
runs-on: ubuntu-24.04
permissions:
checks: read
statuses: read
steps:
- name: Wait for Docker build
uses: lewagon/wait-on-check-action@74049309dfeff245fe8009a0137eacf28136cb3c # v1.5.0
@@ -26,6 +30,8 @@ jobs:
name: Build Release
needs: wait-for-docker
runs-on: ubuntu-24.04
permissions:
contents: read
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
@@ -40,8 +46,7 @@ jobs:
uses: actions/setup-node@53b83947a5a98c8d113130e565377fae1a50d02f # v6.3.0
with:
node-version: 24.x
cache: 'pnpm'
cache-dependency-path: 'src-ui/pnpm-lock.yaml'
package-manager-cache: false
- name: Install frontend dependencies
run: cd src-ui && pnpm install
- name: Build frontend
@@ -56,7 +61,7 @@ jobs:
uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 # v7.3.1
with:
version: ${{ env.DEFAULT_UV_VERSION }}
enable-cache: true
enable-cache: false
python-version: ${{ steps.setup-python.outputs.python-version }}
- name: Install Python dependencies
run: |
@@ -129,6 +134,9 @@ jobs:
name: Publish Release
needs: build-release
runs-on: ubuntu-24.04
permissions:
contents: write
pull-requests: write
outputs:
prerelease: ${{ steps.get-version.outputs.prerelease }}
changelog: ${{ steps.create-release.outputs.body }}
@@ -141,9 +149,11 @@ jobs:
path: ./
- name: Get version info
id: get-version
env:
REF_NAME: ${{ github.ref_name }}
run: |
echo "version=${{ github.ref_name }}" >> $GITHUB_OUTPUT
if [[ "${{ github.ref_name }}" == *"-beta.rc"* ]]; then
echo "version=${REF_NAME}" >> $GITHUB_OUTPUT
if [[ "${REF_NAME}" == *"-beta.rc"* ]]; then
echo "prerelease=true" >> $GITHUB_OUTPUT
else
echo "prerelease=false" >> $GITHUB_OUTPUT
@@ -176,6 +186,9 @@ jobs:
needs: publish-release
if: needs.publish-release.outputs.prerelease == 'false'
runs-on: ubuntu-24.04
permissions:
contents: write
pull-requests: write
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
@@ -191,15 +204,17 @@ jobs:
uses: astral-sh/setup-uv@5a095e7a2014a4212f075830d4f7277575a9d098 # v7.3.1
with:
version: ${{ env.DEFAULT_UV_VERSION }}
enable-cache: true
enable-cache: false
python-version: ${{ env.DEFAULT_PYTHON_VERSION }}
- name: Update changelog
working-directory: docs
env:
CHANGELOG: ${{ needs.publish-release.outputs.changelog }}
run: |
git branch ${{ needs.publish-release.outputs.version }}-changelog
git checkout ${{ needs.publish-release.outputs.version }}-changelog
echo -e "# Changelog\n\n${{ needs.publish-release.outputs.changelog }}\n" > changelog-new.md
printf '# Changelog\n\n%s\n' "${CHANGELOG}" > changelog-new.md
echo "Manually linking usernames"
sed -i -r 's|@([a-zA-Z0-9_]+) \(\[#|[@\1](https://github.com/\1) ([#|g' changelog-new.md

View File

@@ -33,10 +33,18 @@ jobs:
container:
image: semgrep/semgrep:1.155.0@sha256:cc869c685dcc0fe497c86258da9f205397d8108e56d21a86082ea4886e52784d
if: github.actor != 'dependabot[bot]'
permissions:
contents: read
security-events: write
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
- name: Run Semgrep
run: semgrep scan --config auto
run: semgrep scan --config auto --sarif-output results.sarif
- name: Upload results to GitHub code scanning
uses: github/codeql-action/upload-sarif@c10b8064de6f491fea524254123dbe5e09572f13 # v4.35.1
if: always()
with:
sarif_file: results.sarif

View File

@@ -12,6 +12,7 @@ on:
concurrency:
group: registry-tags-cleanup
cancel-in-progress: false
permissions: {}
jobs:
cleanup-images:
name: Cleanup Image Tags for ${{ matrix.primary-name }}

View File

@@ -6,6 +6,9 @@ on:
push:
paths: ['src/locale/**', 'src-ui/messages.xlf', 'src-ui/src/locale/**']
branches: [dev]
permissions:
contents: write
pull-requests: write
jobs:
synchronize-with-crowdin:
name: Crowdin Sync

View File

@@ -3,10 +3,6 @@ on:
schedule:
- cron: '0 3 * * *'
workflow_dispatch:
permissions:
issues: write
pull-requests: write
discussions: write
concurrency:
group: lock
jobs:
@@ -14,6 +10,9 @@ jobs:
name: 'Stale'
if: github.repository_owner == 'paperless-ngx'
runs-on: ubuntu-24.04
permissions:
issues: write
pull-requests: write
steps:
- uses: actions/stale@b5d41d4e1d5dceea10e7104786b73624c18a190f # v10.2.0
with:
@@ -36,6 +35,10 @@ jobs:
name: 'Lock Old Threads'
if: github.repository_owner == 'paperless-ngx'
runs-on: ubuntu-24.04
permissions:
issues: write
pull-requests: write
discussions: write
steps:
- uses: dessant/lock-threads@7266a7ce5c1df01b1c6db85bf8cd86c737dadbe7 # v6.0.0
with:
@@ -56,6 +59,8 @@ jobs:
name: 'Close Answered Discussions'
if: github.repository_owner == 'paperless-ngx'
runs-on: ubuntu-24.04
permissions:
discussions: write
steps:
- uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
@@ -113,6 +118,8 @@ jobs:
name: 'Close Outdated Discussions'
if: github.repository_owner == 'paperless-ngx'
runs-on: ubuntu-24.04
permissions:
discussions: write
steps:
- uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
@@ -205,6 +212,8 @@ jobs:
name: 'Close Unsupported Feature Requests'
if: github.repository_owner == 'paperless-ngx'
runs-on: ubuntu-24.04
permissions:
discussions: write
steps:
- uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:

61
.github/zizmor.yml vendored Normal file
View File

@@ -0,0 +1,61 @@
rules:
template-injection:
ignore:
# github.event_name is a GitHub-internal constant (push/pull_request/etc.),
# not attacker-controllable.
- ci-backend.yml:35
- ci-docker.yml:74
- ci-docs.yml:33
- ci-frontend.yml:32
# github.event.repository.default_branch refers to the target repo's setting,
# which only admins can change; not influenced by fork PR authors.
- ci-backend.yml:47
- ci-docs.yml:45
- ci-frontend.yml:44
# steps.setup-python.outputs.python-version is always a semver string (e.g. "3.12.0")
# produced by actions/setup-python from a hardcoded env var input.
- ci-backend.yml:106
- ci-backend.yml:121
- ci-backend.yml:169
- ci-docs.yml:88
- ci-docs.yml:92
- ci-release.yml:69
- ci-release.yml:78
- ci-release.yml:90
- ci-release.yml:96
- ci-release.yml:229
# needs.*.result is always one of: success/failure/cancelled/skipped.
- ci-backend.yml:211
- ci-backend.yml:212
- ci-backend.yml:216
- ci-docs.yml:131
- ci-docs.yml:132
- ci-frontend.yml:259
- ci-frontend.yml:260
- ci-frontend.yml:264
- ci-frontend.yml:269
- ci-frontend.yml:274
- ci-frontend.yml:279
# needs.changes.outputs.* is always "true" or "false".
- ci-backend.yml:206
- ci-docs.yml:126
- ci-frontend.yml:254
# steps.build.outputs.digest is always a SHA256 digest (sha256:[a-f0-9]{64}).
- ci-docker.yml:152
# needs.publish-release.outputs.version is the git tag name (e.g. v2.14.0);
# only maintainers can push tags upstream, and the tag pattern excludes
# shell metacharacters. Used in git commands and github-script JS, not eval.
- ci-release.yml:215
- ci-release.yml:216
- ci-release.yml:231
- ci-release.yml:237
- ci-release.yml:245
- ci-release.yml:248
dangerous-triggers:
ignore:
# Both workflows use pull_request_target solely to label/comment on fork PRs
# (requires write-back access unavailable to pull_request). Neither workflow
# checks out PR code or executes anything from the fork — only reads PR
# metadata via context/API. Permissions are scoped to pull-requests: write.
- pr-bot.yml:2
- project-actions.yml:2

1
.gitignore vendored
View File

@@ -111,3 +111,4 @@ celerybeat-schedule*
# ignore pnpm package store folder created when setting up the devcontainer
.pnpm-store/
.worktrees

View File

@@ -821,11 +821,14 @@ parsing documents.
#### [`PAPERLESS_OCR_MODE=<mode>`](#PAPERLESS_OCR_MODE) {#PAPERLESS_OCR_MODE}
: Tell paperless when and how to perform ocr on your documents. Three
: Tell paperless when and how to perform ocr on your documents. Four
modes are available:
- `skip`: Paperless skips all pages and will perform ocr only on
pages where no text is present. This is the safest option.
- `auto` (default): Paperless detects whether a document already
has embedded text via pdftotext. If sufficient text is found,
OCR is skipped for that document (`--skip-text`). If no text is
present, OCR runs normally. This is the safest option for mixed
document collections.
- `redo`: Paperless will OCR all pages of your documents and
attempt to replace any existing text layers with new text. This
@@ -843,24 +846,59 @@ modes are available:
significantly larger and text won't appear as sharp when zoomed
in.
The default is `skip`, which only performs OCR when necessary and
always creates archived documents.
- `off`: Paperless never invokes the OCR engine. For PDFs, text
is extracted via pdftotext only. For image documents, text will
be empty. Archive file generation still works via format
conversion (no Tesseract or Ghostscript required).
Read more about this in the [OCRmyPDF
The default is `auto`.
For the `skip`, `redo`, and `force` modes, read more about OCR
behaviour in the [OCRmyPDF
documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped).
#### [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE=<mode>`](#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) {#PAPERLESS_OCR_SKIP_ARCHIVE_FILE}
#### [`PAPERLESS_ARCHIVE_FILE_GENERATION=<mode>`](#PAPERLESS_ARCHIVE_FILE_GENERATION) {#PAPERLESS_ARCHIVE_FILE_GENERATION}
: Specify when you would like paperless to skip creating an archived
version of your documents. This is useful if you don't want to have two
almost-identical versions of your documents in the media folder.
: Controls when paperless creates a PDF/A archive version of your
documents. Archive files are stored alongside the original and are used
for display in the web interface.
- `never`: Never skip creating an archived version.
- `with_text`: Skip creating an archived version for documents
that already have embedded text.
- `always`: Always skip creating an archived version.
- `auto` (default): Produce archives for scanned or image-based
documents. Skip archive generation for born-digital PDFs that
already contain embedded text. This is the recommended setting
for mixed document collections.
- `always`: Always produce a PDF/A archive when the parser
supports it, regardless of whether the document already has
text.
- `never`: Never produce an archive. Only the original file is
stored. Saves disk space but the web viewer will display the
original file directly.
The default is `never`.
**Behaviour by file type and mode** (`auto` column shows the default):
| Document type | `never` | `auto` (default) | `always` |
| -------------------------- | ------- | -------------------------- | -------- |
| Scanned image (TIFF, JPEG) | No | **Yes** | Yes |
| Image-based PDF | No | **Yes** (short/no text, untagged) | Yes |
| Born-digital PDF | No | No (tagged or has embedded text) | Yes |
| Plain text, email, HTML | No | No | No |
| DOCX / ODT (via Tika) | Yes\* | Yes\* | Yes\* |
\* Tika always produces a PDF rendition for display; this counts as
the archive regardless of the setting.
!!! note
This setting applies to the built-in Tesseract parser. Parsers
that must always convert documents to PDF for display (e.g. DOCX,
ODT via Tika) will produce a PDF regardless of this setting.
!!! note
The **remote OCR parser** (Azure AI) always produces a searchable
PDF and stores it as the archive copy, regardless of this setting.
`ARCHIVE_FILE_GENERATION=never` has no effect when the remote
parser handles a document.
#### [`PAPERLESS_OCR_CLEAN=<mode>`](#PAPERLESS_OCR_CLEAN) {#PAPERLESS_OCR_CLEAN}

View File

@@ -123,7 +123,68 @@ Multiple options are combined in a single value:
PAPERLESS_DB_OPTIONS="sslmode=require;sslrootcert=/certs/ca.pem;pool.max_size=10"
```
## Search Index (Whoosh -> Tantivy)
## OCR and Archive File Generation Settings
The settings that control OCR behaviour and archive file generation have been redesigned. The old settings that coupled these two concerns together are **removed** — old values are not silently honoured; a startup warning is logged if any removed variable is still set in your environment.
### Removed settings
| Removed Setting | Replacement |
| ------------------------------------------- | --------------------------------------------------------------------- |
| `PAPERLESS_OCR_MODE=skip` | `PAPERLESS_OCR_MODE=auto` (new default) |
| `PAPERLESS_OCR_MODE=skip_noarchive` | `PAPERLESS_OCR_MODE=auto` + `PAPERLESS_ARCHIVE_FILE_GENERATION=never` |
| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=never` | `PAPERLESS_ARCHIVE_FILE_GENERATION=always` |
| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text` | `PAPERLESS_ARCHIVE_FILE_GENERATION=auto` (new default) |
| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always` | `PAPERLESS_ARCHIVE_FILE_GENERATION=never` |
### What changed and why
Previously, `OCR_MODE` conflated two independent concerns: whether to run OCR and whether to produce an archive. `skip` meant "skip OCR if text exists, but always produce an archive". `skip_noarchive` meant "skip OCR if text exists, and also skip the archive". This made it impossible to, for example, disable OCR entirely while still producing archives.
The new settings are independent:
- [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) controls OCR: `auto` (default), `force`, `redo`, `off`.
- [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) controls archive production: `auto` (default), `always`, `never`.
### Database configuration
If you changed OCR settings via the admin UI (ApplicationConfiguration), the database values are **migrated automatically** during the upgrade. `mode` values (`skip` / `skip_noarchive`) are mapped to their new equivalents and `skip_archive_file` values are converted to the new `archive_file_generation` field. After upgrading, review the OCR settings in the admin UI to confirm the migrated values match your intent.
### Action Required
Remove any `PAPERLESS_OCR_SKIP_ARCHIVE_FILE` variable from your environment. If you relied on `OCR_MODE=skip` or `OCR_MODE=skip_noarchive`, update accordingly:
```bash
# v2: skip OCR when text present, always archive
PAPERLESS_OCR_MODE=skip
# v3: equivalent (auto is the new default)
# No change needed — auto is the default
# v2: skip OCR when text present, skip archive too
PAPERLESS_OCR_MODE=skip_noarchive
# v3: equivalent
PAPERLESS_OCR_MODE=auto
PAPERLESS_ARCHIVE_FILE_GENERATION=never
# v2: always skip archive
PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always
# v3: equivalent
PAPERLESS_ARCHIVE_FILE_GENERATION=never
# v2: skip archive only for born-digital docs
PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text
# v3: equivalent (auto is the new default)
PAPERLESS_ARCHIVE_FILE_GENERATION=auto
```
### Remote OCR parser
If you use the **remote OCR parser** (Azure AI), note that it always produces a
searchable PDF and stores it as the archive copy. `ARCHIVE_FILE_GENERATION=never`
has no effect for documents handled by the remote parser — the archive is produced
unconditionally by the remote engine.
# Search Index (Whoosh -> Tantivy)
The full-text search backend has been replaced with [Tantivy](https://github.com/quickwit-oss/tantivy).
The index format is incompatible with Whoosh, so **the search index is automatically rebuilt from

View File

@@ -633,12 +633,11 @@ hardware, but a few settings can improve performance:
consumption, so you might want to lower these settings (example: 2
workers and 1 thread to always have some computing power left for
other tasks).
- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `skip` and consider
- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `auto` and consider
OCRing your documents before feeding them into Paperless. Some
scanners are able to do this!
- Set [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE`](configuration.md#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) to `with_text` to skip archive
file generation for already OCRed documents, or `always` to skip it
for all documents.
- Set [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) to `never` to skip archive
file generation entirely, saving disk space at the cost of in-browser PDF/A viewing.
- If you want to perform OCR on the device, consider using
`PAPERLESS_OCR_CLEAN=none`. This will speed up OCR times and use
less memory at the expense of slightly worse OCR results.

View File

@@ -134,9 +134,9 @@ following operations on your documents:
!!! tip
This process can be configured to fit your needs. If you don't want
paperless to create archived versions for digital documents, you can
configure that by configuring
`PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text`. Please read the
paperless to create archived versions for born-digital documents, set
[`PAPERLESS_ARCHIVE_FILE_GENERATION=auto`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION)
(the default). To skip archives entirely, use `never`. Please read the
[relevant section in the documentation](configuration.md#ocr).
!!! note
@@ -457,7 +457,7 @@ fields and permissions, which will be merged.
#### Types {#workflow-trigger-types}
Currently, there are five events that correspond to workflow trigger 'types':
Currently, there are four events that correspond to workflow trigger 'types':
1. **Consumption Started**: _before_ a document is consumed, so events can include filters by source (mail, consumption
folder or API), file path, file name, mail rule
@@ -469,10 +469,8 @@ Currently, there are five events that correspond to workflow trigger 'types':
4. **Scheduled**: a scheduled trigger that can be used to run workflows at a specific time. The date used can be either the document
added, created, updated date or you can specify a (date) custom field. You can also specify a day offset from the date (positive
offsets will trigger after the date, negative offsets will trigger before).
5. **Version Added**: when a new version is added for an existing document. This trigger evaluates filters against the root document
and applies actions to the root document.
The following flow diagram illustrates the document trigger types:
The following flow diagram illustrates the four document trigger types:
```mermaid
flowchart TD
@@ -488,10 +486,6 @@ flowchart TD
'Updated'
trigger(s)"}
version{"Matching
'Version Added'
trigger(s)"}
scheduled{"Documents
matching
trigger(s)"}
@@ -508,15 +502,11 @@ flowchart TD
updated --> |Yes| J[Workflow Actions Run]
updated --> |No| K
J --> K[Document Saved]
L[New Document Version Added] --> version
version --> |Yes| V[Workflow Actions Run]
version --> |No| W
V --> W[Document Saved]
X[Scheduled Task Check<br/>hourly at :05] --> Y[Get All Scheduled Triggers]
Y --> scheduled
scheduled --> |Yes| Z[Workflow Actions Run]
scheduled --> |No| AA[Document Saved]
Z --> AA
L[Scheduled Task Check<br/>hourly at :05] --> M[Get All Scheduled Triggers]
M --> scheduled
scheduled --> |Yes| N[Workflow Actions Run]
scheduled --> |No| O[Document Saved]
N --> O
```
#### Filters {#workflow-trigger-filters}

View File

@@ -10456,8 +10456,8 @@
<context context-type="linenumber">111</context>
</context-group>
</trans-unit>
<trans-unit id="6114528299376689399" datatype="html">
<source>Skip Archive File</source>
<trans-unit id="8305051609904776938" datatype="html">
<source>Archive File Generation</source>
<context-group purpose="location">
<context context-type="sourcefile">src/app/data/paperless-config.ts</context>
<context context-type="linenumber">119</context>

View File

@@ -164,7 +164,7 @@
<pngx-input-text i18n-title title="Filter path" formControlName="filter_path" horizontal="true" i18n-hint hint="Apply to documents that match this path. Wildcards specified as * are allowed. Case-normalized." [error]="error?.filter_path"></pngx-input-text>
<pngx-input-select i18n-title title="Filter mail rule" [items]="mailRules" horizontal="true" [allowNull]="true" formControlName="filter_mailrule" i18n-hint hint="Apply to documents consumed via this mail rule." [error]="error?.filter_mailrule"></pngx-input-select>
}
@if (formGroup.get('type').value === WorkflowTriggerType.DocumentAdded || formGroup.get('type').value === WorkflowTriggerType.DocumentUpdated || formGroup.get('type').value === WorkflowTriggerType.Scheduled || formGroup.get('type').value === WorkflowTriggerType.VersionAdded) {
@if (formGroup.get('type').value === WorkflowTriggerType.DocumentAdded || formGroup.get('type').value === WorkflowTriggerType.DocumentUpdated || formGroup.get('type').value === WorkflowTriggerType.Scheduled) {
<pngx-input-select i18n-title title="Content matching algorithm" horizontal="true" [items]="getMatchingAlgorithms()" formControlName="matching_algorithm"></pngx-input-select>
@if (matchingPatternRequired(formGroup)) {
<pngx-input-text i18n-title title="Content matching pattern" horizontal="true" formControlName="match" [error]="error?.match"></pngx-input-text>
@@ -175,7 +175,7 @@
}
</div>
</div>
@if (formGroup.get('type').value === WorkflowTriggerType.DocumentAdded || formGroup.get('type').value === WorkflowTriggerType.DocumentUpdated || formGroup.get('type').value === WorkflowTriggerType.Scheduled || formGroup.get('type').value === WorkflowTriggerType.VersionAdded) {
@if (formGroup.get('type').value === WorkflowTriggerType.DocumentAdded || formGroup.get('type').value === WorkflowTriggerType.DocumentUpdated || formGroup.get('type').value === WorkflowTriggerType.Scheduled) {
<div class="row mt-3">
<div class="col">
<div class="trigger-filters mb-3">

View File

@@ -120,10 +120,6 @@ export const WORKFLOW_TYPE_OPTIONS = [
id: WorkflowTriggerType.Scheduled,
name: $localize`Scheduled`,
},
{
id: WorkflowTriggerType.VersionAdded,
name: $localize`Version Added`,
},
]
export const WORKFLOW_ACTION_OPTIONS = [

View File

@@ -11,16 +11,16 @@ export enum OutputTypeConfig {
}
export enum ModeConfig {
SKIP = 'skip',
REDO = 'redo',
AUTO = 'auto',
FORCE = 'force',
SKIP_NO_ARCHIVE = 'skip_noarchive',
REDO = 'redo',
OFF = 'off',
}
export enum ArchiveFileConfig {
NEVER = 'never',
WITH_TEXT = 'with_text',
AUTO = 'auto',
ALWAYS = 'always',
NEVER = 'never',
}
export enum CleanConfig {
@@ -115,11 +115,11 @@ export const PaperlessConfigOptions: ConfigOption[] = [
category: ConfigCategory.OCR,
},
{
key: 'skip_archive_file',
title: $localize`Skip Archive File`,
key: 'archive_file_generation',
title: $localize`Archive File Generation`,
type: ConfigOptionType.Select,
choices: mapToItems(ArchiveFileConfig),
config_key: 'PAPERLESS_OCR_SKIP_ARCHIVE_FILE',
config_key: 'PAPERLESS_ARCHIVE_FILE_GENERATION',
category: ConfigCategory.OCR,
},
{
@@ -337,7 +337,7 @@ export interface PaperlessConfig extends ObjectWithId {
pages: number
language: string
mode: ModeConfig
skip_archive_file: ArchiveFileConfig
archive_file_generation: ArchiveFileConfig
image_dpi: number
unpaper_clean: CleanConfig
deskew: boolean

View File

@@ -12,7 +12,6 @@ export enum WorkflowTriggerType {
DocumentAdded = 2,
DocumentUpdated = 3,
Scheduled = 4,
VersionAdded = 5,
}
export enum ScheduleDateField {

View File

@@ -10,13 +10,11 @@ class DocumentsConfig(AppConfig):
def ready(self) -> None:
from documents.signals import document_consumption_finished
from documents.signals import document_updated
from documents.signals import document_version_added
from documents.signals.handlers import add_inbox_tags
from documents.signals.handlers import add_or_update_document_in_llm_index
from documents.signals.handlers import add_to_index
from documents.signals.handlers import run_workflows_added
from documents.signals.handlers import run_workflows_updated
from documents.signals.handlers import run_workflows_version_added
from documents.signals.handlers import send_websocket_document_updated
from documents.signals.handlers import set_correspondent
from documents.signals.handlers import set_document_type
@@ -30,7 +28,6 @@ class DocumentsConfig(AppConfig):
document_consumption_finished.connect(set_storage_path)
document_consumption_finished.connect(add_to_index)
document_consumption_finished.connect(run_workflows_added)
document_version_added.connect(run_workflows_version_added)
document_consumption_finished.connect(add_or_update_document_in_llm_index)
document_updated.connect(run_workflows_updated)
document_updated.connect(send_websocket_document_updated)

View File

@@ -11,7 +11,6 @@ from typing import TYPE_CHECKING
if TYPE_CHECKING:
from collections.abc import Callable
from collections.abc import Iterator
from datetime import datetime
from numpy import ndarray
@@ -19,6 +18,7 @@ if TYPE_CHECKING:
from django.conf import settings
from django.core.cache import cache
from django.core.cache import caches
from django.db.models import Max
from documents.caching import CACHE_5_MINUTES
from documents.caching import CACHE_50_MINUTES
@@ -99,7 +99,8 @@ class DocumentClassifier:
# v8 - Added storage path classifier
# v9 - Changed from hashing to time/ids for re-train check
# v10 - HMAC-signed model file
FORMAT_VERSION = 10
# v11 - Added auto-label-set digest for fast skip without full document scan
FORMAT_VERSION = 11
HMAC_SIZE = 32 # SHA-256 digest length
@@ -108,6 +109,8 @@ class DocumentClassifier:
self.last_doc_change_time: datetime | None = None
# Hash of primary keys of AUTO matching values last used in training
self.last_auto_type_hash: bytes | None = None
# Digest of the set of all MATCH_AUTO label PKs (fast-skip guard)
self.last_auto_label_set_digest: bytes | None = None
self.data_vectorizer = None
self.data_vectorizer_hash = None
@@ -140,6 +143,29 @@ class DocumentClassifier:
sha256,
).digest()
@staticmethod
def _compute_auto_label_set_digest() -> bytes:
"""
Return a SHA-256 digest of all MATCH_AUTO label PKs across the four
label types. Four cheap indexed queries; stable for any fixed set of
AUTO labels regardless of document assignments.
"""
from documents.models import Correspondent
from documents.models import DocumentType
from documents.models import StoragePath
from documents.models import Tag
hasher = sha256()
for model in (Correspondent, DocumentType, Tag, StoragePath):
pks = sorted(
model.objects.filter(
matching_algorithm=MatchingModel.MATCH_AUTO,
).values_list("pk", flat=True),
)
for pk in pks:
hasher.update(pk.to_bytes(4, "little", signed=False))
return hasher.digest()
def load(self) -> None:
from sklearn.exceptions import InconsistentVersionWarning
@@ -161,6 +187,7 @@ class DocumentClassifier:
schema_version,
self.last_doc_change_time,
self.last_auto_type_hash,
self.last_auto_label_set_digest,
self.data_vectorizer,
self.tags_binarizer,
self.tags_classifier,
@@ -202,6 +229,7 @@ class DocumentClassifier:
self.FORMAT_VERSION,
self.last_doc_change_time,
self.last_auto_type_hash,
self.last_auto_label_set_digest,
self.data_vectorizer,
self.tags_binarizer,
self.tags_classifier,
@@ -224,6 +252,39 @@ class DocumentClassifier:
) -> bool:
notify = status_callback if status_callback is not None else lambda _: None
# Fast skip: avoid the expensive per-document label scan when nothing
# has changed. Requires a prior training run to have populated both
# last_doc_change_time and last_auto_label_set_digest.
if (
self.last_doc_change_time is not None
and self.last_auto_label_set_digest is not None
):
latest_mod = Document.objects.exclude(
tags__is_inbox_tag=True,
).aggregate(Max("modified"))["modified__max"]
if latest_mod is not None and latest_mod <= self.last_doc_change_time:
current_digest = self._compute_auto_label_set_digest()
if current_digest == self.last_auto_label_set_digest:
logger.info("No updates since last training")
cache.set(
CLASSIFIER_MODIFIED_KEY,
self.last_doc_change_time,
CACHE_50_MINUTES,
)
cache.set(
CLASSIFIER_HASH_KEY,
self.last_auto_type_hash.hex()
if self.last_auto_type_hash
else "",
CACHE_50_MINUTES,
)
cache.set(
CLASSIFIER_VERSION_KEY,
self.FORMAT_VERSION,
CACHE_50_MINUTES,
)
return False
# Get non-inbox documents
docs_queryset = (
Document.objects.exclude(
@@ -242,12 +303,15 @@ class DocumentClassifier:
labels_correspondent = []
labels_document_type = []
labels_storage_path = []
doc_contents: list[str] = []
# Step 1: Extract and preprocess training data from the database.
# Step 1: Extract labels and capture content in a single pass.
logger.debug("Gathering data from database...")
notify(f"Gathering data from {docs_queryset.count()} document(s)...")
hasher = sha256()
for doc in docs_queryset:
doc_contents.append(doc.content)
y = -1
dt = doc.document_type
if dt and dt.matching_algorithm == MatchingModel.MATCH_AUTO:
@@ -282,25 +346,7 @@ class DocumentClassifier:
num_tags = len(labels_tags_unique)
# Check if retraining is actually required.
# A document has been updated since the classifier was trained
# New auto tags, types, correspondent, storage paths exist
latest_doc_change = docs_queryset.latest("modified").modified
if (
self.last_doc_change_time is not None
and self.last_doc_change_time >= latest_doc_change
) and self.last_auto_type_hash == hasher.digest():
logger.info("No updates since last training")
# Set the classifier information into the cache
# Caching for 50 minutes, so slightly less than the normal retrain time
cache.set(
CLASSIFIER_MODIFIED_KEY,
self.last_doc_change_time,
CACHE_50_MINUTES,
)
cache.set(CLASSIFIER_HASH_KEY, hasher.hexdigest(), CACHE_50_MINUTES)
cache.set(CLASSIFIER_VERSION_KEY, self.FORMAT_VERSION, CACHE_50_MINUTES)
return False
# subtract 1 since -1 (null) is also part of the classes.
@@ -317,21 +363,16 @@ class DocumentClassifier:
)
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.svm import LinearSVC
# Step 2: vectorize data
logger.debug("Vectorizing data...")
notify("Vectorizing document content...")
def content_generator() -> Iterator[str]:
"""
Generates the content for documents, but once at a time
"""
for doc in docs_queryset:
yield self.preprocess_content(doc.content, shared_cache=False)
self.data_vectorizer = CountVectorizer(
analyzer="word",
ngram_range=(1, 2),
@@ -339,7 +380,8 @@ class DocumentClassifier:
)
data_vectorized: ndarray = self.data_vectorizer.fit_transform(
content_generator(),
self.preprocess_content(content, shared_cache=False)
for content in doc_contents
)
# See the notes here:
@@ -353,8 +395,10 @@ class DocumentClassifier:
notify(f"Training tags classifier ({num_tags} tag(s))...")
if num_tags == 1:
# Special case where only one tag has auto:
# Fallback to binary classification.
# Special case: only one AUTO tag — use binary classification.
# MLPClassifier is used here because LinearSVC requires at least
# 2 distinct classes in training data, which cannot be guaranteed
# when all documents share the single AUTO tag.
labels_tags = [
label[0] if len(label) == 1 else -1 for label in labels_tags
]
@@ -362,11 +406,15 @@ class DocumentClassifier:
labels_tags_vectorized: ndarray = self.tags_binarizer.fit_transform(
labels_tags,
).ravel()
self.tags_classifier = MLPClassifier(tol=0.01)
else:
# General multi-label case: LinearSVC via OneVsRestClassifier.
# Vastly more memory- and time-efficient than MLPClassifier for
# large class counts (e.g. hundreds of AUTO tags).
self.tags_binarizer = MultiLabelBinarizer()
labels_tags_vectorized = self.tags_binarizer.fit_transform(labels_tags)
self.tags_classifier = OneVsRestClassifier(LinearSVC())
self.tags_classifier = MLPClassifier(tol=0.01)
self.tags_classifier.fit(data_vectorized, labels_tags_vectorized)
else:
self.tags_classifier = None
@@ -416,6 +464,7 @@ class DocumentClassifier:
self.last_doc_change_time = latest_doc_change
self.last_auto_type_hash = hasher.digest()
self.last_auto_label_set_digest = self._compute_auto_label_set_digest()
self._update_data_vectorizer_hash()
# Set the classifier information into the cache

View File

@@ -1,4 +1,5 @@
import datetime
import logging
import os
import shutil
import tempfile
@@ -44,16 +45,20 @@ from documents.plugins.helpers import ProgressStatusOptions
from documents.signals import document_consumption_finished
from documents.signals import document_consumption_started
from documents.signals import document_updated
from documents.signals import document_version_added
from documents.signals.handlers import run_workflows
from documents.templating.workflows import parse_w_workflow_placeholders
from documents.utils import compute_checksum
from documents.utils import copy_basic_file_stats
from documents.utils import copy_file_with_basic_stats
from documents.utils import run_subprocess
from paperless.config import OcrConfig
from paperless.models import ArchiveFileGenerationChoices
from paperless.parsers import ParserContext
from paperless.parsers import ParserProtocol
from paperless.parsers.registry import get_parser_registry
from paperless.parsers.utils import PDF_TEXT_MIN_LENGTH
from paperless.parsers.utils import extract_pdf_text
from paperless.parsers.utils import is_tagged_pdf
LOGGING_NAME: Final[str] = "paperless.consumer"
@@ -106,6 +111,74 @@ class ConsumerStatusShortMessage(StrEnum):
FAILED = "failed"
def should_produce_archive(
parser: "ParserProtocol",
mime_type: str,
document_path: Path,
log: logging.Logger | None = None,
) -> bool:
"""Return True if a PDF/A archive should be produced for this document.
IMPORTANT: *parser* must be an instantiated parser, not the class.
``requires_pdf_rendition`` and ``can_produce_archive`` are instance
``@property`` methods — accessing them on the class returns the descriptor
(always truthy).
"""
_log = log or logging.getLogger(LOGGING_NAME)
# Must produce a PDF so the frontend can display the original format at all.
if parser.requires_pdf_rendition:
_log.debug("Archive: yes — parser requires PDF rendition for frontend display")
return True
# Parser cannot produce an archive (e.g. TextDocumentParser).
if not parser.can_produce_archive:
_log.debug("Archive: no — parser cannot produce archives")
return False
generation = OcrConfig().archive_file_generation
if generation == ArchiveFileGenerationChoices.ALWAYS:
_log.debug("Archive: yes — ARCHIVE_FILE_GENERATION=always")
return True
if generation == ArchiveFileGenerationChoices.NEVER:
_log.debug("Archive: no — ARCHIVE_FILE_GENERATION=never")
return False
# auto: produce archives for scanned/image documents; skip for born-digital PDFs.
if mime_type.startswith("image/"):
_log.debug("Archive: yes — image document, ARCHIVE_FILE_GENERATION=auto")
return True
if mime_type == "application/pdf":
if is_tagged_pdf(document_path):
_log.debug(
"Archive: no — born-digital PDF (structure tags detected),"
" ARCHIVE_FILE_GENERATION=auto",
)
return False
text = extract_pdf_text(document_path)
if text is None or len(text) <= PDF_TEXT_MIN_LENGTH:
_log.debug(
"Archive: yes — scanned PDF (text_length=%d%d),"
" ARCHIVE_FILE_GENERATION=auto",
len(text) if text else 0,
PDF_TEXT_MIN_LENGTH,
)
return True
_log.debug(
"Archive: no — born-digital PDF (text_length=%d > %d),"
" ARCHIVE_FILE_GENERATION=auto",
len(text),
PDF_TEXT_MIN_LENGTH,
)
return False
_log.debug(
"Archive: no — MIME type %r not eligible for auto archive generation",
mime_type,
)
return False
class ConsumerPluginMixin:
if TYPE_CHECKING:
from logging import Logger
@@ -437,7 +510,17 @@ class ConsumerPlugin(
)
self.log.debug(f"Parsing {self.filename}...")
document_parser.parse(self.working_copy, mime_type)
produce_archive = should_produce_archive(
document_parser,
mime_type,
self.working_copy,
self.log,
)
document_parser.parse(
self.working_copy,
mime_type,
produce_archive=produce_archive,
)
self.log.debug(f"Generating thumbnail for {self.filename}...")
self._send_progress(
@@ -577,13 +660,6 @@ class ConsumerPlugin(
else self.working_copy,
)
if document.root_document_id:
document_version_added.send(
sender=self.__class__,
document=document,
logging_group=self.logging_group,
)
# After everything is in the database, copy the files into
# place. If this fails, we'll also rollback the transaction.
with FileLock(settings.MEDIA_LOCK):
@@ -793,7 +869,7 @@ class ConsumerPlugin(
return document
def apply_overrides(self, document) -> None:
def apply_overrides(self, document: Document) -> None:
if self.metadata.correspondent_id:
document.correspondent = Correspondent.objects.get(
pk=self.metadata.correspondent_id,

View File

@@ -689,7 +689,6 @@ def document_matches_workflow(
trigger_type == WorkflowTrigger.WorkflowTriggerType.DOCUMENT_ADDED
or trigger_type == WorkflowTrigger.WorkflowTriggerType.DOCUMENT_UPDATED
or trigger_type == WorkflowTrigger.WorkflowTriggerType.SCHEDULED
or trigger_type == WorkflowTrigger.WorkflowTriggerType.VERSION_ADDED
):
trigger_matched, reason = existing_document_matches_workflow(
document,

View File

@@ -1,28 +0,0 @@
# Generated by Django 5.2.7 on 2026-03-02 00:00
from django.db import migrations
from django.db import models
class Migration(migrations.Migration):
dependencies = [
("documents", "0018_saved_view_simple_search_rules"),
]
operations = [
migrations.AlterField(
model_name="workflowtrigger",
name="type",
field=models.PositiveSmallIntegerField(
choices=[
(1, "Consumption Started"),
(2, "Document Added"),
(3, "Document Updated"),
(4, "Scheduled"),
(5, "Version Added"),
],
default=1,
verbose_name="Workflow Trigger Type",
),
),
]

View File

@@ -1183,7 +1183,6 @@ class WorkflowTrigger(models.Model):
DOCUMENT_ADDED = 2, _("Document Added")
DOCUMENT_UPDATED = 3, _("Document Updated")
SCHEDULED = 4, _("Scheduled")
VERSION_ADDED = 5, _("Version Added")
class DocumentSourceChoices(models.IntegerChoices):
CONSUME_FOLDER = DocumentSource.ConsumeFolder.value, _("Consume Folder")

View File

@@ -3,4 +3,3 @@ from django.dispatch import Signal
document_consumption_started = Signal()
document_consumption_finished = Signal()
document_updated = Signal()
document_version_added = Signal()

View File

@@ -814,19 +814,6 @@ def run_workflows_added(
)
def run_workflows_version_added(
sender,
document: Document,
logging_group: uuid.UUID | None = None,
**kwargs,
) -> None:
run_workflows(
trigger_type=WorkflowTrigger.WorkflowTriggerType.VERSION_ADDED,
document=document.root_document,
logging_group=logging_group,
)
def run_workflows_updated(
sender,
document: Document,

View File

@@ -30,6 +30,7 @@ from documents.consumer import AsnCheckPlugin
from documents.consumer import ConsumerPlugin
from documents.consumer import ConsumerPreflightPlugin
from documents.consumer import WorkflowTriggerPlugin
from documents.consumer import should_produce_archive
from documents.data_models import ConsumableDocument
from documents.data_models import DocumentMetadataOverrides
from documents.double_sided import CollatePlugin
@@ -311,7 +312,16 @@ def update_document_content_maybe_archive_file(document_id) -> None:
parser.configure(ParserContext())
try:
parser.parse(document.source_path, mime_type)
produce_archive = should_produce_archive(
parser,
mime_type,
document.source_path,
)
parser.parse(
document.source_path,
mime_type,
produce_archive=produce_archive,
)
thumbnail = parser.get_thumbnail(document.source_path, mime_type)

View File

@@ -46,7 +46,7 @@ class TestApiAppConfig(DirectoriesMixin, APITestCase):
"pages": None,
"language": None,
"mode": None,
"skip_archive_file": None,
"archive_file_generation": None,
"image_dpi": None,
"unpaper_clean": None,
"deskew": None,

View File

@@ -1020,7 +1020,7 @@ class TestTagBarcode(DirectoriesMixin, SampleDirMixin, GetReaderPluginMixin, Tes
CONSUMER_TAG_BARCODE_SPLIT=True,
CONSUMER_TAG_BARCODE_MAPPING={"TAG:(.*)": "\\g<1>"},
CELERY_TASK_ALWAYS_EAGER=True,
OCR_MODE="skip",
OCR_MODE="auto",
)
def test_consume_barcode_file_tag_split_and_assignment(self) -> None:
"""

View File

@@ -0,0 +1,134 @@
"""
Phase 2 — Single queryset pass in DocumentClassifier.train()
The document queryset must be iterated exactly once: during the label
extraction loop, which now also captures doc.content for vectorization.
The previous content_generator() caused a second full table scan.
"""
from __future__ import annotations
from unittest import mock
import pytest
from django.db.models.query import QuerySet
from documents.classifier import DocumentClassifier
from documents.models import Correspondent
from documents.models import Document
from documents.models import DocumentType
from documents.models import MatchingModel
from documents.models import StoragePath
from documents.models import Tag
# ---------------------------------------------------------------------------
# Fixtures (mirrors test_classifier_train_skip.py)
# ---------------------------------------------------------------------------
@pytest.fixture()
def classifier_settings(settings, tmp_path):
settings.MODEL_FILE = tmp_path / "model.pickle"
return settings
@pytest.fixture()
def classifier(classifier_settings):
return DocumentClassifier()
@pytest.fixture()
def label_corpus(classifier_settings):
c_auto = Correspondent.objects.create(
name="Auto Corp",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
dt_auto = DocumentType.objects.create(
name="Invoice",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
t_auto = Tag.objects.create(
name="finance",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
sp_auto = StoragePath.objects.create(
name="Finance Path",
path="finance/{correspondent}",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
doc_a = Document.objects.create(
title="Invoice A",
content="quarterly invoice payment tax financial statement revenue",
correspondent=c_auto,
document_type=dt_auto,
storage_path=sp_auto,
checksum="aaa",
mime_type="application/pdf",
filename="invoice_a.pdf",
)
doc_a.tags.set([t_auto])
doc_b = Document.objects.create(
title="Invoice B",
content="monthly invoice billing statement account balance due",
correspondent=c_auto,
document_type=dt_auto,
storage_path=sp_auto,
checksum="bbb",
mime_type="application/pdf",
filename="invoice_b.pdf",
)
doc_b.tags.set([t_auto])
doc_c = Document.objects.create(
title="Notes",
content="meeting notes agenda discussion summary action items follow",
checksum="ccc",
mime_type="application/pdf",
filename="notes_c.pdf",
)
return {"doc_a": doc_a, "doc_b": doc_b, "doc_c": doc_c}
# ---------------------------------------------------------------------------
# Tests
# ---------------------------------------------------------------------------
@pytest.mark.django_db()
class TestSingleQuerysetPass:
def test_train_iterates_document_queryset_once(self, classifier, label_corpus):
"""
train() must iterate the Document queryset exactly once.
Before Phase 2 there were two iterations: one in the label extraction
loop and a second inside content_generator() for CountVectorizer.
After Phase 2 content is captured during the label loop; the second
iteration is eliminated.
"""
original_iter = QuerySet.__iter__
doc_iter_count = 0
def counting_iter(qs):
nonlocal doc_iter_count
if qs.model is Document:
doc_iter_count += 1
return original_iter(qs)
with mock.patch.object(QuerySet, "__iter__", counting_iter):
classifier.train()
assert doc_iter_count == 1, (
f"Expected 1 Document queryset iteration, got {doc_iter_count}. "
"content_generator() may still be re-fetching from the DB."
)
def test_train_result_unchanged(self, classifier, label_corpus):
"""
Collapsing to a single pass must not change what the classifier learns:
a second train() with no changes still returns False.
"""
assert classifier.train() is True
assert classifier.train() is False

View File

@@ -0,0 +1,300 @@
"""
Tags classifier correctness test — Phase 3b gate.
This test must pass both BEFORE and AFTER the MLPClassifier → LinearSVC swap.
It verifies that the tags classifier correctly learns discriminative signal and
predicts the right tags on held-out documents.
Run before the swap to establish a baseline, then run again after to confirm
the new algorithm is at least as correct.
Two scenarios are tested:
1. Multi-tag (num_tags > 1) — the common case; uses MultiLabelBinarizer
2. Single-tag (num_tags == 1) — special binary path; uses LabelBinarizer
Corpus design: each tag has a distinct vocabulary cluster. Each training
document contains words from exactly one cluster (or two for multi-tag docs).
Held-out test documents contain the same cluster words; correct classification
requires the model to learn the vocabulary → tag mapping.
"""
from __future__ import annotations
import pytest
from documents.classifier import DocumentClassifier
from documents.models import Correspondent
from documents.models import Document
from documents.models import DocumentType
from documents.models import MatchingModel
from documents.models import StoragePath
from documents.models import Tag
# ---------------------------------------------------------------------------
# Vocabulary clusters — intentionally non-overlapping so both MLP and SVM
# should learn them perfectly or near-perfectly.
# ---------------------------------------------------------------------------
FINANCE_WORDS = (
"invoice payment tax revenue billing statement account receivable "
"quarterly budget expense ledger debit credit profit loss fiscal"
)
LEGAL_WORDS = (
"contract agreement terms conditions clause liability indemnity "
"jurisdiction arbitration compliance regulation statute obligation"
)
MEDICAL_WORDS = (
"prescription diagnosis treatment patient health symptom dosage "
"physician referral therapy clinical examination procedure chronic"
)
HR_WORDS = (
"employee salary onboarding performance review appraisal benefits "
"recruitment hiring resignation termination payroll department staff"
)
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture()
def classifier_settings(settings, tmp_path):
settings.MODEL_FILE = tmp_path / "model.pickle"
return settings
@pytest.fixture()
def classifier(classifier_settings):
return DocumentClassifier()
def _make_doc(title, content, checksum, tags=(), **kwargs):
doc = Document.objects.create(
title=title,
content=content,
checksum=checksum,
mime_type="application/pdf",
filename=f"{checksum}.pdf",
**kwargs,
)
if tags:
doc.tags.set(tags)
return doc
def _words(cluster, extra=""):
"""Repeat cluster words enough times to clear min_df=0.01 at ~40 docs."""
return f"{cluster} {cluster} {extra}".strip()
# ---------------------------------------------------------------------------
# Multi-tag correctness
# ---------------------------------------------------------------------------
@pytest.fixture()
def multi_tag_corpus(classifier_settings):
"""
40 training documents across 4 AUTO tags with distinct vocabulary.
10 single-tag docs per tag + 5 two-tag docs. Total: 45 docs.
A non-AUTO correspondent and doc type are included to keep the
other classifiers happy and not raise ValueError.
"""
t_finance = Tag.objects.create(
name="finance",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
t_legal = Tag.objects.create(
name="legal",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
t_medical = Tag.objects.create(
name="medical",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
t_hr = Tag.objects.create(name="hr", matching_algorithm=MatchingModel.MATCH_AUTO)
# non-AUTO labels to keep the other classifiers from raising
c = Correspondent.objects.create(
name="org",
matching_algorithm=MatchingModel.MATCH_NONE,
)
dt = DocumentType.objects.create(
name="doc",
matching_algorithm=MatchingModel.MATCH_NONE,
)
sp = StoragePath.objects.create(
name="archive",
path="archive",
matching_algorithm=MatchingModel.MATCH_NONE,
)
checksum = 0
def make(title, content, tags):
nonlocal checksum
checksum += 1
return _make_doc(
title,
content,
f"{checksum:04d}",
tags=tags,
correspondent=c,
document_type=dt,
storage_path=sp,
)
# 10 single-tag training docs per tag
for i in range(10):
make(f"finance-{i}", _words(FINANCE_WORDS, f"doc{i}"), [t_finance])
make(f"legal-{i}", _words(LEGAL_WORDS, f"doc{i}"), [t_legal])
make(f"medical-{i}", _words(MEDICAL_WORDS, f"doc{i}"), [t_medical])
make(f"hr-{i}", _words(HR_WORDS, f"doc{i}"), [t_hr])
# 5 two-tag training docs
for i in range(5):
make(
f"finance-legal-{i}",
_words(FINANCE_WORDS + " " + LEGAL_WORDS, f"combo{i}"),
[t_finance, t_legal],
)
return {
"t_finance": t_finance,
"t_legal": t_legal,
"t_medical": t_medical,
"t_hr": t_hr,
}
@pytest.mark.django_db()
class TestMultiTagCorrectness:
"""
The tags classifier must correctly predict tags on held-out documents whose
content clearly belongs to one or two vocabulary clusters.
A prediction is "correct" if the expected tag is present in the result.
"""
def test_single_cluster_docs_predicted_correctly(
self,
classifier,
multi_tag_corpus,
):
"""Each single-cluster held-out doc gets exactly the right tag."""
classifier.train()
tags = multi_tag_corpus
cases = [
(FINANCE_WORDS + " unique alpha", [tags["t_finance"].pk]),
(LEGAL_WORDS + " unique beta", [tags["t_legal"].pk]),
(MEDICAL_WORDS + " unique gamma", [tags["t_medical"].pk]),
(HR_WORDS + " unique delta", [tags["t_hr"].pk]),
]
for content, expected_pks in cases:
predicted = classifier.predict_tags(content)
for pk in expected_pks:
assert pk in predicted, (
f"Expected tag pk={pk} in predictions for content starting "
f"'{content[:40]}', got {predicted}"
)
def test_multi_cluster_doc_gets_both_tags(self, classifier, multi_tag_corpus):
"""A document with finance + legal vocabulary gets both tags."""
classifier.train()
tags = multi_tag_corpus
content = FINANCE_WORDS + " " + LEGAL_WORDS + " unique epsilon"
predicted = classifier.predict_tags(content)
assert tags["t_finance"].pk in predicted, f"Expected finance tag in {predicted}"
assert tags["t_legal"].pk in predicted, f"Expected legal tag in {predicted}"
def test_unrelated_content_predicts_no_trained_tags(
self,
classifier,
multi_tag_corpus,
):
"""
Completely alien content should not confidently fire any learned tag.
This is a soft check — we only assert no false positives on a document
that shares zero vocabulary with the training corpus.
"""
classifier.train()
alien = (
"xyzzyx qwerty asdfgh zxcvbn plokij unique zeta "
"xyzzyx qwerty asdfgh zxcvbn plokij unique zeta"
)
predicted = classifier.predict_tags(alien)
# Not a hard requirement — just log for human inspection
# Both MLP and SVM may or may not produce false positives on OOV content
assert isinstance(predicted, list)
# ---------------------------------------------------------------------------
# Single-tag (binary) correctness
# ---------------------------------------------------------------------------
@pytest.fixture()
def single_tag_corpus(classifier_settings):
"""
Corpus with exactly ONE AUTO tag, exercising the LabelBinarizer +
binary classification path. Documents either have the tag or don't.
"""
t_finance = Tag.objects.create(
name="finance",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
c = Correspondent.objects.create(
name="org",
matching_algorithm=MatchingModel.MATCH_NONE,
)
dt = DocumentType.objects.create(
name="doc",
matching_algorithm=MatchingModel.MATCH_NONE,
)
checksum = 0
def make(title, content, tags):
nonlocal checksum
checksum += 1
return _make_doc(
title,
content,
f"s{checksum:04d}",
tags=tags,
correspondent=c,
document_type=dt,
)
for i in range(10):
make(f"finance-{i}", _words(FINANCE_WORDS, f"s{i}"), [t_finance])
make(f"other-{i}", _words(LEGAL_WORDS, f"s{i}"), [])
return {"t_finance": t_finance}
@pytest.mark.django_db()
class TestSingleTagCorrectness:
def test_finance_content_predicts_finance_tag(self, classifier, single_tag_corpus):
"""Finance vocabulary → finance tag predicted."""
classifier.train()
tags = single_tag_corpus
predicted = classifier.predict_tags(FINANCE_WORDS + " unique alpha single")
assert tags["t_finance"].pk in predicted, (
f"Expected finance tag pk={tags['t_finance'].pk} in {predicted}"
)
def test_non_finance_content_predicts_no_tag(self, classifier, single_tag_corpus):
"""Non-finance vocabulary → no tag predicted."""
classifier.train()
predicted = classifier.predict_tags(LEGAL_WORDS + " unique beta single")
assert predicted == [], f"Expected no tags, got {predicted}"

View File

@@ -0,0 +1,325 @@
"""
Phase 1 — fast-skip optimisation in DocumentClassifier.train()
The goal: when nothing has changed since the last training run, train() should
return False after at most 5 DB queries (1x MAX(modified) + 4x MATCH_AUTO pk
lists), not after a full per-document label scan.
Correctness invariant: the skip must NOT fire when the set of AUTO-matching
labels has changed, even if no Document.modified timestamp has advanced (e.g.
a Tag's matching_algorithm was flipped to MATCH_AUTO after the last train).
"""
from __future__ import annotations
from typing import TYPE_CHECKING
import pytest
from django.db import connection
from django.test.utils import CaptureQueriesContext
from documents.classifier import DocumentClassifier
from documents.models import Correspondent
from documents.models import Document
from documents.models import DocumentType
from documents.models import MatchingModel
from documents.models import StoragePath
from documents.models import Tag
if TYPE_CHECKING:
from pathlib import Path
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture()
def classifier_settings(settings, tmp_path: Path):
"""Point MODEL_FILE at a temp directory so tests are hermetic."""
settings.MODEL_FILE = tmp_path / "model.pickle"
return settings
@pytest.fixture()
def classifier(classifier_settings):
"""Fresh DocumentClassifier instance with test settings active."""
return DocumentClassifier()
@pytest.fixture()
def label_corpus(classifier_settings):
"""
Minimal label + document corpus that produces a trainable classifier.
Creates
-------
Correspondents
c_auto — MATCH_AUTO, assigned to two docs
c_none — MATCH_NONE (control)
DocumentTypes
dt_auto — MATCH_AUTO, assigned to two docs
dt_none — MATCH_NONE (control)
Tags
t_auto — MATCH_AUTO, applied to two docs
t_none — MATCH_NONE (control, applied to one doc but never learned)
StoragePaths
sp_auto — MATCH_AUTO, assigned to two docs
sp_none — MATCH_NONE (control)
Documents
doc_a, doc_b — assigned AUTO labels above
doc_c — control doc (MATCH_NONE labels only)
The fixture returns a dict with all created objects for direct mutation in
individual tests.
"""
c_auto = Correspondent.objects.create(
name="Auto Corp",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
c_none = Correspondent.objects.create(
name="Manual Corp",
matching_algorithm=MatchingModel.MATCH_NONE,
)
dt_auto = DocumentType.objects.create(
name="Invoice",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
dt_none = DocumentType.objects.create(
name="Other",
matching_algorithm=MatchingModel.MATCH_NONE,
)
t_auto = Tag.objects.create(
name="finance",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
t_none = Tag.objects.create(
name="misc",
matching_algorithm=MatchingModel.MATCH_NONE,
)
sp_auto = StoragePath.objects.create(
name="Finance Path",
path="finance/{correspondent}",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
sp_none = StoragePath.objects.create(
name="Other Path",
path="other/{correspondent}",
matching_algorithm=MatchingModel.MATCH_NONE,
)
doc_a = Document.objects.create(
title="Invoice from Auto Corp Jan",
content="quarterly invoice payment tax financial statement revenue",
correspondent=c_auto,
document_type=dt_auto,
storage_path=sp_auto,
checksum="aaa",
mime_type="application/pdf",
filename="invoice_a.pdf",
)
doc_a.tags.set([t_auto])
doc_b = Document.objects.create(
title="Invoice from Auto Corp Feb",
content="monthly invoice billing statement account balance due",
correspondent=c_auto,
document_type=dt_auto,
storage_path=sp_auto,
checksum="bbb",
mime_type="application/pdf",
filename="invoice_b.pdf",
)
doc_b.tags.set([t_auto])
# Control document — no AUTO labels, but has enough content to vectorize
doc_c = Document.objects.create(
title="Miscellaneous Notes",
content="meeting notes agenda discussion summary action items follow",
correspondent=c_none,
document_type=dt_none,
checksum="ccc",
mime_type="application/pdf",
filename="notes_c.pdf",
)
doc_c.tags.set([t_none])
return {
"c_auto": c_auto,
"c_none": c_none,
"dt_auto": dt_auto,
"dt_none": dt_none,
"t_auto": t_auto,
"t_none": t_none,
"sp_auto": sp_auto,
"sp_none": sp_none,
"doc_a": doc_a,
"doc_b": doc_b,
"doc_c": doc_c,
}
# ---------------------------------------------------------------------------
# Happy-path skip tests
# ---------------------------------------------------------------------------
@pytest.mark.django_db()
class TestFastSkipFires:
"""The no-op path: nothing changed, so the second train() is skipped."""
def test_first_train_returns_true(self, classifier, label_corpus):
"""First train on a fresh classifier must return True (did work)."""
assert classifier.train() is True
def test_second_train_returns_false(self, classifier, label_corpus):
"""Second train with no changes must return False (skipped)."""
classifier.train()
assert classifier.train() is False
def test_fast_skip_runs_minimal_queries(self, classifier, label_corpus):
"""
The no-op path must use at most 5 DB queries:
1x Document.objects.aggregate(Max('modified'))
4x MATCH_AUTO pk lists (Correspondent / DocumentType / Tag / StoragePath)
The current implementation (before Phase 1) iterates every document
to build the label hash BEFORE it can decide to skip, which is O(N).
This test verifies the fast path is in place.
"""
classifier.train()
with CaptureQueriesContext(connection) as ctx:
result = classifier.train()
assert result is False
assert len(ctx.captured_queries) <= 5, (
f"Fast skip used {len(ctx.captured_queries)} queries; expected ≤5.\n"
+ "\n".join(q["sql"] for q in ctx.captured_queries)
)
def test_fast_skip_refreshes_cache_keys(self, classifier, label_corpus):
"""
Even on a skip, the cache keys must be refreshed so that the task
scheduler can detect the classifier is still current.
"""
from django.core.cache import cache
from documents.caching import CLASSIFIER_HASH_KEY
from documents.caching import CLASSIFIER_MODIFIED_KEY
from documents.caching import CLASSIFIER_VERSION_KEY
classifier.train()
# Evict the keys to prove skip re-populates them
cache.delete(CLASSIFIER_MODIFIED_KEY)
cache.delete(CLASSIFIER_HASH_KEY)
cache.delete(CLASSIFIER_VERSION_KEY)
result = classifier.train()
assert result is False
assert cache.get(CLASSIFIER_MODIFIED_KEY) is not None
assert cache.get(CLASSIFIER_HASH_KEY) is not None
assert cache.get(CLASSIFIER_VERSION_KEY) is not None
# ---------------------------------------------------------------------------
# Correctness tests — skip must NOT fire when the world has changed
# ---------------------------------------------------------------------------
@pytest.mark.django_db()
class TestFastSkipDoesNotFire:
"""The skip guard must yield to a full retrain whenever labels change."""
def test_document_content_modification_triggers_retrain(
self,
classifier,
label_corpus,
):
"""Updating a document's content updates modified → retrain required."""
classifier.train()
doc_a = label_corpus["doc_a"]
doc_a.content = "completely different words here now nothing same"
doc_a.save()
assert classifier.train() is True
def test_document_label_reassignment_triggers_retrain(
self,
classifier,
label_corpus,
):
"""
Reassigning a document to a different AUTO correspondent (touching
doc.modified) must trigger a retrain.
"""
c_auto2 = Correspondent.objects.create(
name="Second Auto Corp",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
classifier.train()
doc_a = label_corpus["doc_a"]
doc_a.correspondent = c_auto2
doc_a.save()
assert classifier.train() is True
def test_matching_algorithm_change_on_assigned_tag_triggers_retrain(
self,
classifier,
label_corpus,
):
"""
Flipping a tag's matching_algorithm to MATCH_AUTO after it is already
assigned to documents must trigger a retrain — even though no
Document.modified timestamp advances.
This is the key correctness case for the auto-label-set digest:
the tag is already on doc_a and doc_b, so once it becomes MATCH_AUTO
the classifier needs to learn it.
"""
# t_none is applied to doc_c (a control doc) via the fixture.
# We flip it to MATCH_AUTO; the set of learnable AUTO tags grows.
classifier.train()
t_none = label_corpus["t_none"]
t_none.matching_algorithm = MatchingModel.MATCH_AUTO
t_none.save(update_fields=["matching_algorithm"])
# Document.modified is NOT touched — this test specifically verifies
# that the auto-label-set digest catches the change.
assert classifier.train() is True
def test_new_auto_correspondent_triggers_retrain(self, classifier, label_corpus):
"""
Adding a brand-new MATCH_AUTO correspondent (unassigned to any doc)
must trigger a retrain: the auto-label-set has grown.
"""
classifier.train()
Correspondent.objects.create(
name="New Auto Corp",
matching_algorithm=MatchingModel.MATCH_AUTO,
)
assert classifier.train() is True
def test_removing_auto_label_triggers_retrain(self, classifier, label_corpus):
"""
Deleting a MATCH_AUTO correspondent shrinks the auto-label-set and
must trigger a retrain.
"""
classifier.train()
label_corpus["c_auto"].delete()
assert classifier.train() is True
def test_fresh_classifier_always_trains(self, classifier, label_corpus):
"""
A classifier that has never been trained (last_doc_change_time is None)
must always perform a full train, regardless of corpus state.
"""
assert classifier.last_doc_change_time is None
assert classifier.train() is True
def test_no_documents_raises_value_error(self, classifier, classifier_settings):
"""train() with an empty database must raise ValueError."""
with pytest.raises(ValueError, match="No training data"):
classifier.train()

View File

@@ -230,7 +230,11 @@ class TestConsumer(
shutil.copy(src, dst)
return dst
@override_settings(FILENAME_FORMAT=None, TIME_ZONE="America/Chicago")
@override_settings(
FILENAME_FORMAT=None,
TIME_ZONE="America/Chicago",
ARCHIVE_FILE_GENERATION="always",
)
def testNormalOperation(self) -> None:
filename = self.get_test_file()
@@ -629,7 +633,10 @@ class TestConsumer(
# Database empty
self.assertEqual(Document.objects.all().count(), 0)
@override_settings(FILENAME_FORMAT="{correspondent}/{title}")
@override_settings(
FILENAME_FORMAT="{correspondent}/{title}",
ARCHIVE_FILE_GENERATION="always",
)
def testFilenameHandling(self) -> None:
with self.get_consumer(
self.get_test_file(),
@@ -646,7 +653,7 @@ class TestConsumer(
self._assert_first_last_send_progress()
@mock.patch("documents.consumer.generate_unique_filename")
@override_settings(FILENAME_FORMAT="{pk}")
@override_settings(FILENAME_FORMAT="{pk}", ARCHIVE_FILE_GENERATION="always")
def testFilenameHandlingFallsBackWhenGeneratedPathExceedsDbLimit(self, m):
m.side_effect = lambda doc, archive_filename=False: Path(
("a" * 1100 + ".pdf") if not archive_filename else ("b" * 1100 + ".pdf"),
@@ -673,7 +680,10 @@ class TestConsumer(
self._assert_first_last_send_progress()
@override_settings(FILENAME_FORMAT="{correspondent}/{title}")
@override_settings(
FILENAME_FORMAT="{correspondent}/{title}",
ARCHIVE_FILE_GENERATION="always",
)
@mock.patch("documents.signals.handlers.generate_unique_filename")
def testFilenameHandlingUnstableFormat(self, m) -> None:
filenames = ["this", "that", "now this", "i cannot decide"]
@@ -720,16 +730,9 @@ class TestConsumer(
self._assert_first_last_send_progress()
@override_settings(AUDIT_LOG_ENABLED=True)
@mock.patch("documents.consumer.document_updated.send")
@mock.patch("documents.consumer.document_version_added.send")
@mock.patch("documents.consumer.load_classifier")
def test_consume_version_creates_new_version(
self,
mock_load_classifier: mock.Mock,
mock_document_version_added_send: mock.Mock,
mock_document_updated_send: mock.Mock,
) -> None:
mock_load_classifier.return_value = MagicMock()
def test_consume_version_creates_new_version(self, m) -> None:
m.return_value = MagicMock()
with self.get_consumer(self.get_test_file()) as consumer:
consumer.run()
@@ -797,16 +800,6 @@ class TestConsumer(
self.assertIsNone(version.archive_serial_number)
self.assertEqual(version.original_filename, version_file.name)
self.assertTrue(bool(version.content))
mock_document_version_added_send.assert_called_once()
self.assertEqual(
mock_document_version_added_send.call_args.kwargs["document"].id,
version.id,
)
mock_document_updated_send.assert_called_once()
self.assertEqual(
mock_document_updated_send.call_args.kwargs["document"].id,
root_doc.id,
)
@override_settings(AUDIT_LOG_ENABLED=True)
@mock.patch("documents.consumer.load_classifier")
@@ -1038,7 +1031,7 @@ class TestConsumer(
self.assertEqual(Document.objects.count(), 2)
self._assert_first_last_send_progress()
@override_settings(FILENAME_FORMAT="{title}")
@override_settings(FILENAME_FORMAT="{title}", ARCHIVE_FILE_GENERATION="always")
@mock.patch("documents.consumer.get_parser_registry")
def test_similar_filenames(self, m) -> None:
shutil.copy(
@@ -1149,6 +1142,7 @@ class TestConsumer(
mock_mail_parser_parse.assert_called_once_with(
consumer.working_copy,
"message/rfc822",
produce_archive=True,
)
@@ -1296,7 +1290,14 @@ class PreConsumeTestCase(DirectoriesMixin, GetConsumerMixin, TestCase):
def test_no_pre_consume_script(self, m) -> None:
with self.get_consumer(self.test_file) as c:
c.run()
m.assert_not_called()
# Verify no pre-consume script subprocess was invoked
# (run_subprocess may still be called by _extract_text_for_archive_check)
script_calls = [
call
for call in m.call_args_list
if call.args and call.args[0] and call.args[0][0] not in ("pdftotext",)
]
self.assertEqual(script_calls, [])
@mock.patch("documents.consumer.run_subprocess")
@override_settings(PRE_CONSUME_SCRIPT="does-not-exist")
@@ -1312,9 +1313,16 @@ class PreConsumeTestCase(DirectoriesMixin, GetConsumerMixin, TestCase):
with self.get_consumer(self.test_file) as c:
c.run()
m.assert_called_once()
self.assertTrue(m.called)
args, _ = m.call_args
# Find the call that invoked the pre-consume script
# (run_subprocess may also be called by _extract_text_for_archive_check)
script_call = next(
call
for call in m.call_args_list
if call.args and call.args[0] and call.args[0][0] == script.name
)
args, _ = script_call
command = args[0]
environment = args[1]

View File

@@ -0,0 +1,189 @@
"""Tests for should_produce_archive()."""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
from unittest.mock import MagicMock
import pytest
from documents.consumer import should_produce_archive
if TYPE_CHECKING:
from pytest_mock import MockerFixture
def _parser_instance(
*,
can_produce: bool = True,
requires_rendition: bool = False,
) -> MagicMock:
"""Return a mock parser instance with the given capability flags."""
instance = MagicMock()
instance.can_produce_archive = can_produce
instance.requires_pdf_rendition = requires_rendition
return instance
@pytest.fixture()
def null_app_config(mocker) -> MagicMock:
"""Mock ApplicationConfiguration with all fields None → falls back to Django settings."""
return mocker.MagicMock(
output_type=None,
pages=None,
language=None,
mode=None,
archive_file_generation=None,
image_dpi=None,
unpaper_clean=None,
deskew=None,
rotate_pages=None,
rotate_pages_threshold=None,
max_image_pixels=None,
color_conversion_strategy=None,
user_args=None,
)
@pytest.fixture(autouse=True)
def patch_app_config(mocker, null_app_config):
"""Patch BaseConfig._get_config_instance for all tests in this module."""
mocker.patch(
"paperless.config.BaseConfig._get_config_instance",
return_value=null_app_config,
)
class TestShouldProduceArchive:
@pytest.mark.parametrize(
("generation", "can_produce", "requires_rendition", "mime", "expected"),
[
pytest.param(
"never",
True,
False,
"application/pdf",
False,
id="never-returns-false",
),
pytest.param(
"always",
True,
False,
"application/pdf",
True,
id="always-returns-true",
),
pytest.param(
"never",
True,
True,
"application/pdf",
True,
id="requires-rendition-overrides-never",
),
pytest.param(
"always",
False,
False,
"text/plain",
False,
id="cannot-produce-overrides-always",
),
pytest.param(
"always",
False,
True,
"application/pdf",
True,
id="requires-rendition-wins-even-if-cannot-produce",
),
pytest.param(
"auto",
True,
False,
"image/tiff",
True,
id="auto-image-returns-true",
),
pytest.param(
"auto",
True,
False,
"message/rfc822",
False,
id="auto-non-pdf-non-image-returns-false",
),
],
)
def test_generation_setting(
self,
settings,
generation: str,
can_produce: bool, # noqa: FBT001
requires_rendition: bool, # noqa: FBT001
mime: str,
expected: bool, # noqa: FBT001
) -> None:
settings.ARCHIVE_FILE_GENERATION = generation
parser = _parser_instance(
can_produce=can_produce,
requires_rendition=requires_rendition,
)
assert should_produce_archive(parser, mime, Path("/tmp/doc")) is expected
@pytest.mark.parametrize(
("extracted_text", "expected"),
[
pytest.param(
"This is a born-digital PDF with lots of text content. " * 10,
False,
id="born-digital-long-text-skips-archive",
),
pytest.param(None, True, id="no-text-scanned-produces-archive"),
pytest.param("tiny", True, id="short-text-treated-as-scanned"),
],
)
def test_auto_pdf_archive_decision(
self,
mocker: MockerFixture,
settings,
extracted_text: str | None,
expected: bool, # noqa: FBT001
) -> None:
settings.ARCHIVE_FILE_GENERATION = "auto"
mocker.patch("documents.consumer.is_tagged_pdf", return_value=False)
mocker.patch("documents.consumer.extract_pdf_text", return_value=extracted_text)
parser = _parser_instance(can_produce=True, requires_rendition=False)
assert (
should_produce_archive(parser, "application/pdf", Path("/tmp/doc.pdf"))
is expected
)
def test_tagged_pdf_skips_archive_in_auto_mode(
self,
mocker: MockerFixture,
settings,
) -> None:
"""Tagged PDFs (e.g. Word exports) are treated as born-digital regardless of text length."""
settings.ARCHIVE_FILE_GENERATION = "auto"
mocker.patch("documents.consumer.is_tagged_pdf", return_value=True)
parser = _parser_instance(can_produce=True, requires_rendition=False)
assert (
should_produce_archive(parser, "application/pdf", Path("/tmp/doc.pdf"))
is False
)
def test_tagged_pdf_does_not_call_pdftotext(
self,
mocker: MockerFixture,
settings,
) -> None:
"""When a PDF is tagged, pdftotext is not invoked (fast path)."""
settings.ARCHIVE_FILE_GENERATION = "auto"
mocker.patch("documents.consumer.is_tagged_pdf", return_value=True)
mock_extract = mocker.patch("documents.consumer.extract_pdf_text")
parser = _parser_instance(can_produce=True, requires_rendition=False)
should_produce_archive(parser, "application/pdf", Path("/tmp/doc.pdf"))
mock_extract.assert_not_called()

View File

@@ -27,7 +27,10 @@ sample_file: Path = Path(__file__).parent / "samples" / "simple.pdf"
@pytest.mark.management
@override_settings(FILENAME_FORMAT="{correspondent}/{title}")
@override_settings(
FILENAME_FORMAT="{correspondent}/{title}",
ARCHIVE_FILE_GENERATION="always",
)
class TestArchiver(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
def make_models(self):
return Document.objects.create(

View File

@@ -213,6 +213,7 @@ class TestEmptyTrashTask(DirectoriesMixin, FileSystemAssertsMixin, TestCase):
self.assertEqual(Document.global_objects.count(), 0)
@override_settings(ARCHIVE_FILE_GENERATION="always")
class TestUpdateContent(DirectoriesMixin, TestCase):
def test_update_content_maybe_archive_file(self) -> None:
"""

View File

@@ -61,7 +61,6 @@ from documents.models import WorkflowTrigger
from documents.plugins.base import StopConsumeTaskError
from documents.serialisers import WorkflowTriggerSerializer
from documents.signals import document_consumption_finished
from documents.signals import document_version_added
from documents.tests.utils import DirectoriesMixin
from documents.tests.utils import DummyProgressManager
from documents.tests.utils import FileSystemAssertsMixin
@@ -1903,53 +1902,6 @@ class TestWorkflows(
).exists(),
)
def test_version_added_workflow_runs_on_root_document(self) -> None:
trigger = WorkflowTrigger.objects.create(
type=WorkflowTrigger.WorkflowTriggerType.VERSION_ADDED,
)
action = WorkflowAction.objects.create(
assign_title="Updated by version",
assign_owner=self.user2,
)
workflow = Workflow.objects.create(
name="Version workflow",
order=0,
)
workflow.triggers.add(trigger)
workflow.actions.add(action)
root_doc = Document.objects.create(
title="root",
correspondent=self.c,
original_filename="root.pdf",
)
version_doc = Document.objects.create(
title="version",
correspondent=self.c,
original_filename="version.pdf",
root_document=root_doc,
)
document_version_added.send(
sender=self.__class__,
document=version_doc,
)
root_doc.refresh_from_db()
version_doc.refresh_from_db()
self.assertEqual(root_doc.title, "Updated by version")
self.assertEqual(root_doc.owner, self.user2)
self.assertIsNone(version_doc.owner)
self.assertEqual(
WorkflowRun.objects.filter(
workflow=workflow,
type=WorkflowTrigger.WorkflowTriggerType.VERSION_ADDED,
document=root_doc,
).count(),
1,
)
def test_document_updated_workflow(self) -> None:
trigger = WorkflowTrigger.objects.create(
type=WorkflowTrigger.WorkflowTriggerType.DOCUMENT_UPDATED,

View File

@@ -2,7 +2,7 @@ msgid ""
msgstr ""
"Project-Id-Version: paperless-ngx\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2026-04-03 20:54+0000\n"
"POT-Creation-Date: 2026-04-06 22:51+0000\n"
"PO-Revision-Date: 2022-02-17 04:17\n"
"Last-Translator: \n"
"Language-Team: English\n"
@@ -1666,32 +1666,28 @@ msgstr ""
msgid "pdfa-3"
msgstr ""
#: paperless/models.py:39
msgid "skip"
#: paperless/models.py:39 paperless/models.py:50
msgid "auto"
msgstr ""
#: paperless/models.py:40
msgid "redo"
msgstr ""
#: paperless/models.py:41
msgid "force"
msgstr ""
#: paperless/models.py:42
msgid "skip_noarchive"
#: paperless/models.py:41
msgid "redo"
msgstr ""
#: paperless/models.py:50
msgid "never"
#: paperless/models.py:42
msgid "off"
msgstr ""
#: paperless/models.py:51
msgid "with_text"
msgid "always"
msgstr ""
#: paperless/models.py:52
msgid "always"
msgid "never"
msgstr ""
#: paperless/models.py:60
@@ -1755,7 +1751,7 @@ msgid "Sets the OCR mode"
msgstr ""
#: paperless/models.py:130
msgid "Controls the generation of an archive file"
msgid "Controls archive file generation"
msgstr ""
#: paperless/models.py:138

View File

@@ -5,6 +5,7 @@ import shutil
import stat
import subprocess
from pathlib import Path
from typing import Any
from django.conf import settings
from django.core.checks import Error
@@ -22,7 +23,7 @@ writeable_hint = (
)
def path_check(var, directory: Path) -> list[Error]:
def path_check(var: str, directory: Path) -> list[Error]:
messages: list[Error] = []
if directory:
if not directory.is_dir():
@@ -59,7 +60,7 @@ def path_check(var, directory: Path) -> list[Error]:
@register()
def paths_check(app_configs, **kwargs) -> list[Error]:
def paths_check(app_configs: Any, **kwargs: Any) -> list[Error]:
"""
Check the various paths for existence, readability and writeability
"""
@@ -73,7 +74,7 @@ def paths_check(app_configs, **kwargs) -> list[Error]:
@register()
def binaries_check(app_configs, **kwargs):
def binaries_check(app_configs: Any, **kwargs: Any) -> list[Error]:
"""
Paperless requires the existence of a few binaries, so we do some checks
for those here.
@@ -93,7 +94,7 @@ def binaries_check(app_configs, **kwargs):
@register()
def debug_mode_check(app_configs, **kwargs):
def debug_mode_check(app_configs: Any, **kwargs: Any) -> list[Warning]:
if settings.DEBUG:
return [
Warning(
@@ -109,7 +110,7 @@ def debug_mode_check(app_configs, **kwargs):
@register()
def settings_values_check(app_configs, **kwargs):
def settings_values_check(app_configs: Any, **kwargs: Any) -> list[Error | Warning]:
"""
Validates at least some of the user provided settings
"""
@@ -132,23 +133,14 @@ def settings_values_check(app_configs, **kwargs):
Error(f'OCR output type "{settings.OCR_OUTPUT_TYPE}" is not valid'),
)
if settings.OCR_MODE not in {"force", "skip", "redo", "skip_noarchive"}:
if settings.OCR_MODE not in {"auto", "force", "redo", "off"}:
msgs.append(Error(f'OCR output mode "{settings.OCR_MODE}" is not valid'))
if settings.OCR_MODE == "skip_noarchive":
msgs.append(
Warning(
'OCR output mode "skip_noarchive" is deprecated and will be '
"removed in a future version. Please use "
"PAPERLESS_OCR_SKIP_ARCHIVE_FILE instead.",
),
)
if settings.OCR_SKIP_ARCHIVE_FILE not in {"never", "with_text", "always"}:
if settings.ARCHIVE_FILE_GENERATION not in {"auto", "always", "never"}:
msgs.append(
Error(
"OCR_SKIP_ARCHIVE_FILE setting "
f'"{settings.OCR_SKIP_ARCHIVE_FILE}" is not valid',
"PAPERLESS_ARCHIVE_FILE_GENERATION setting "
f'"{settings.ARCHIVE_FILE_GENERATION}" is not valid',
),
)
@@ -191,7 +183,7 @@ def settings_values_check(app_configs, **kwargs):
@register()
def audit_log_check(app_configs, **kwargs):
def audit_log_check(app_configs: Any, **kwargs: Any) -> list[Error]:
db_conn = connections["default"]
all_tables = db_conn.introspection.table_names()
result = []
@@ -303,7 +295,42 @@ def check_deprecated_db_settings(
@register()
def check_remote_parser_configured(app_configs, **kwargs) -> list[Error]:
def check_deprecated_v2_ocr_env_vars(
app_configs: object,
**kwargs: object,
) -> list[Warning]:
"""Warn when deprecated v2 OCR environment variables are set.
Users upgrading from v2 may still have these in their environment or
config files, where they are now silently ignored.
"""
warnings: list[Warning] = []
if os.environ.get("PAPERLESS_OCR_SKIP_ARCHIVE_FILE"):
warnings.append(
Warning(
"PAPERLESS_OCR_SKIP_ARCHIVE_FILE is set but has no effect. "
"Use PAPERLESS_ARCHIVE_FILE_GENERATION=never/always/auto instead.",
id="paperless.W002",
),
)
ocr_mode = os.environ.get("PAPERLESS_OCR_MODE", "")
if ocr_mode in {"skip", "skip_noarchive"}:
warnings.append(
Warning(
f"PAPERLESS_OCR_MODE={ocr_mode!r} is not a valid value. "
f"Use PAPERLESS_OCR_MODE=auto (and PAPERLESS_ARCHIVE_FILE_GENERATION=never "
f"if you used skip_noarchive) instead.",
id="paperless.W003",
),
)
return warnings
@register()
def check_remote_parser_configured(app_configs: Any, **kwargs: Any) -> list[Error]:
if settings.REMOTE_OCR_ENGINE == "azureai" and not (
settings.REMOTE_OCR_ENDPOINT and settings.REMOTE_OCR_API_KEY
):
@@ -329,7 +356,7 @@ def get_tesseract_langs():
@register()
def check_default_language_available(app_configs, **kwargs):
def check_default_language_available(app_configs: Any, **kwargs: Any) -> list[Error]:
errs = []
if not settings.OCR_LANGUAGE:

View File

@@ -4,6 +4,11 @@ import json
from django.conf import settings
from paperless.models import ApplicationConfiguration
from paperless.models import ArchiveFileGenerationChoices
from paperless.models import CleanChoices
from paperless.models import ColorConvertChoices
from paperless.models import ModeChoices
from paperless.models import OutputTypeChoices
@dataclasses.dataclass
@@ -28,7 +33,7 @@ class OutputTypeConfig(BaseConfig):
Almost all parsers care about the chosen PDF output format
"""
output_type: str = dataclasses.field(init=False)
output_type: OutputTypeChoices = dataclasses.field(init=False)
def __post_init__(self) -> None:
app_config = self._get_config_instance()
@@ -45,15 +50,17 @@ class OcrConfig(OutputTypeConfig):
pages: int | None = dataclasses.field(init=False)
language: str = dataclasses.field(init=False)
mode: str = dataclasses.field(init=False)
skip_archive_file: str = dataclasses.field(init=False)
mode: ModeChoices = dataclasses.field(init=False)
archive_file_generation: ArchiveFileGenerationChoices = dataclasses.field(
init=False,
)
image_dpi: int | None = dataclasses.field(init=False)
clean: str = dataclasses.field(init=False)
clean: CleanChoices = dataclasses.field(init=False)
deskew: bool = dataclasses.field(init=False)
rotate: bool = dataclasses.field(init=False)
rotate_threshold: float = dataclasses.field(init=False)
max_image_pixel: float | None = dataclasses.field(init=False)
color_conversion_strategy: str = dataclasses.field(init=False)
color_conversion_strategy: ColorConvertChoices = dataclasses.field(init=False)
user_args: dict[str, str] | None = dataclasses.field(init=False)
def __post_init__(self) -> None:
@@ -64,8 +71,8 @@ class OcrConfig(OutputTypeConfig):
self.pages = app_config.pages or settings.OCR_PAGES
self.language = app_config.language or settings.OCR_LANGUAGE
self.mode = app_config.mode or settings.OCR_MODE
self.skip_archive_file = (
app_config.skip_archive_file or settings.OCR_SKIP_ARCHIVE_FILE
self.archive_file_generation = (
app_config.archive_file_generation or settings.ARCHIVE_FILE_GENERATION
)
self.image_dpi = app_config.image_dpi or settings.OCR_IMAGE_DPI
self.clean = app_config.unpaper_clean or settings.OCR_CLEAN

View File

@@ -0,0 +1,90 @@
# Generated by Django 5.2.12 on 2026-03-26 20:31
from django.db import migrations
from django.db import models
_MODE_MAP = {
"skip": "auto",
"redo": "redo",
"force": "force",
"skip_noarchive": "auto",
}
_ARCHIVE_MAP = {
# never skip -> always generate
"never": "always",
# skip when text present -> auto
"with_text": "auto",
# always skip -> never generate
"always": "never",
}
def migrate_old_values(apps, schema_editor):
ApplicationConfiguration = apps.get_model("paperless", "ApplicationConfiguration")
for config in ApplicationConfiguration.objects.all():
old_mode = config.mode
old_skip = config.skip_archive_file
# Map the old mode value
if old_mode in _MODE_MAP:
config.mode = _MODE_MAP[old_mode]
# Map skip_archive_file -> archive_file_generation
if old_skip in _ARCHIVE_MAP:
config.archive_file_generation = _ARCHIVE_MAP[old_skip]
# skip_noarchive implied no archive file; set that if the user
# didn't already have an explicit skip_archive_file preference
if old_mode == "skip_noarchive" and old_skip is None:
config.archive_file_generation = "never"
config.save()
class Migration(migrations.Migration):
dependencies = [
("paperless", "0007_optimize_integer_field_sizes"),
]
operations = [
# 1. Update mode choices in-place (old values still in the column)
migrations.AlterField(
model_name="applicationconfiguration",
name="mode",
field=models.CharField(
blank=True,
choices=[
("auto", "auto"),
("force", "force"),
("redo", "redo"),
("off", "off"),
],
max_length=16,
null=True,
verbose_name="Sets the OCR mode",
),
),
# 2. Add the new field
migrations.AddField(
model_name="applicationconfiguration",
name="archive_file_generation",
field=models.CharField(
blank=True,
choices=[("auto", "auto"), ("always", "always"), ("never", "never")],
max_length=8,
null=True,
verbose_name="Controls archive file generation",
),
),
# 3. Migrate data from old values to new
migrations.RunPython(
migrate_old_values,
migrations.RunPython.noop,
),
# 4. Drop the old field
migrations.RemoveField(
model_name="applicationconfiguration",
name="skip_archive_file",
),
]

View File

@@ -36,20 +36,20 @@ class ModeChoices(models.TextChoices):
and our own custom setting
"""
SKIP = ("skip", _("skip"))
REDO = ("redo", _("redo"))
AUTO = ("auto", _("auto"))
FORCE = ("force", _("force"))
SKIP_NO_ARCHIVE = ("skip_noarchive", _("skip_noarchive"))
REDO = ("redo", _("redo"))
OFF = ("off", _("off"))
class ArchiveFileChoices(models.TextChoices):
class ArchiveFileGenerationChoices(models.TextChoices):
"""
Settings to control creation of an archive PDF file
"""
NEVER = ("never", _("never"))
WITH_TEXT = ("with_text", _("with_text"))
AUTO = ("auto", _("auto"))
ALWAYS = ("always", _("always"))
NEVER = ("never", _("never"))
class CleanChoices(models.TextChoices):
@@ -126,12 +126,12 @@ class ApplicationConfiguration(AbstractSingletonModel):
choices=ModeChoices.choices,
)
skip_archive_file = models.CharField(
verbose_name=_("Controls the generation of an archive file"),
archive_file_generation = models.CharField(
verbose_name=_("Controls archive file generation"),
null=True,
blank=True,
max_length=16,
choices=ArchiveFileChoices.choices,
max_length=8,
choices=ArchiveFileGenerationChoices.choices,
)
image_dpi = models.PositiveSmallIntegerField(

View File

@@ -1,5 +1,6 @@
from __future__ import annotations
import importlib.resources
import logging
import os
import re
@@ -8,6 +9,8 @@ import tempfile
from pathlib import Path
from typing import TYPE_CHECKING
from typing import Any
from typing import Final
from typing import NoReturn
from typing import Self
from django.conf import settings
@@ -15,12 +18,16 @@ from PIL import Image
from documents.parsers import ParseError
from documents.parsers import make_thumbnail_from_pdf
from documents.utils import copy_file_with_basic_stats
from documents.utils import maybe_override_pixel_limit
from documents.utils import run_subprocess
from paperless.config import OcrConfig
from paperless.models import ArchiveFileChoices
from paperless.models import CleanChoices
from paperless.models import ModeChoices
from paperless.models import OutputTypeChoices
from paperless.parsers.utils import PDF_TEXT_MIN_LENGTH
from paperless.parsers.utils import extract_pdf_text
from paperless.parsers.utils import is_tagged_pdf
from paperless.parsers.utils import read_file_handle_unicode_errors
from paperless.version import __full_version_str__
@@ -33,7 +40,11 @@ if TYPE_CHECKING:
logger = logging.getLogger("paperless.parsing.tesseract")
_SUPPORTED_MIME_TYPES: dict[str, str] = {
_SRGB_ICC_DATA: Final[bytes] = (
importlib.resources.files("ocrmypdf.data").joinpath("sRGB.icc").read_bytes()
)
_SUPPORTED_MIME_TYPES: Final[dict[str, str]] = {
"application/pdf": ".pdf",
"image/jpeg": ".jpg",
"image/png": ".png",
@@ -99,7 +110,7 @@ class RasterisedDocumentParser:
# Lifecycle
# ------------------------------------------------------------------
def __init__(self, logging_group: object = None) -> None:
def __init__(self, logging_group: object | None = None) -> None:
settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
self.tempdir = Path(
tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR),
@@ -233,7 +244,7 @@ class RasterisedDocumentParser:
if (
sidecar_file is not None
and sidecar_file.is_file()
and self.settings.mode != "redo"
and self.settings.mode != ModeChoices.REDO
):
text = read_file_handle_unicode_errors(sidecar_file)
@@ -250,36 +261,7 @@ class RasterisedDocumentParser:
if not Path(pdf_file).is_file():
return None
try:
text = None
with tempfile.NamedTemporaryFile(
mode="w+",
dir=self.tempdir,
) as tmp:
run_subprocess(
[
"pdftotext",
"-q",
"-layout",
"-enc",
"UTF-8",
str(pdf_file),
tmp.name,
],
logger=self.log,
)
text = read_file_handle_unicode_errors(Path(tmp.name))
return post_process_text(text)
except Exception:
# If pdftotext fails, fall back to OCR.
self.log.warning(
"Error while getting text from PDF document with pdftotext",
exc_info=True,
)
# probably not a PDF file.
return None
return post_process_text(extract_pdf_text(Path(pdf_file), log=self.log))
def construct_ocrmypdf_parameters(
self,
@@ -289,6 +271,7 @@ class RasterisedDocumentParser:
sidecar_file: Path,
*,
safe_fallback: bool = False,
skip_text: bool = False,
) -> dict[str, Any]:
ocrmypdf_args: dict[str, Any] = {
"input_file_or_options": input_file,
@@ -307,15 +290,14 @@ class RasterisedDocumentParser:
self.settings.color_conversion_strategy
)
if self.settings.mode == ModeChoices.FORCE or safe_fallback:
if safe_fallback or self.settings.mode == ModeChoices.FORCE:
ocrmypdf_args["force_ocr"] = True
elif self.settings.mode in {
ModeChoices.SKIP,
ModeChoices.SKIP_NO_ARCHIVE,
}:
ocrmypdf_args["skip_text"] = True
elif self.settings.mode == ModeChoices.REDO:
ocrmypdf_args["redo_ocr"] = True
elif skip_text or self.settings.mode == ModeChoices.OFF:
ocrmypdf_args["skip_text"] = True
elif self.settings.mode == ModeChoices.AUTO:
pass # no extra flag: normal OCR (text not found case)
else: # pragma: no cover
raise ParseError(f"Invalid ocr mode: {self.settings.mode}")
@@ -400,6 +382,115 @@ class RasterisedDocumentParser:
return ocrmypdf_args
def _convert_image_to_pdfa(self, document_path: Path) -> Path:
"""Convert an image to a PDF/A-2b file without invoking the OCR engine.
Uses img2pdf for the initial image->PDF wrapping, then pikepdf to stamp
PDF/A-2b conformance metadata.
No Tesseract and no Ghostscript are invoked.
"""
import img2pdf
import pikepdf
plain_pdf_path = Path(self.tempdir) / "image_plain.pdf"
try:
convert_kwargs: dict = {}
if self.settings.image_dpi is not None:
convert_kwargs["layout_fun"] = img2pdf.get_fixed_dpi_layout_fun(
(self.settings.image_dpi, self.settings.image_dpi),
)
plain_pdf_path.write_bytes(
img2pdf.convert(str(document_path), **convert_kwargs),
)
except Exception as e:
raise ParseError(
f"img2pdf conversion failed for {document_path}: {e!s}",
) from e
pdfa_path = Path(self.tempdir) / "archive.pdf"
try:
with pikepdf.open(plain_pdf_path) as pdf:
cs = pdf.make_stream(_SRGB_ICC_DATA)
cs["/N"] = 3
output_intent = pikepdf.Dictionary(
Type=pikepdf.Name("/OutputIntent"),
S=pikepdf.Name("/GTS_PDFA1"),
OutputConditionIdentifier=pikepdf.String("sRGB"),
DestOutputProfile=cs,
)
pdf.Root["/OutputIntents"] = pdf.make_indirect(
pikepdf.Array([output_intent]),
)
meta = pdf.open_metadata(set_pikepdf_as_editor=False)
meta["pdfaid:part"] = "2"
meta["pdfaid:conformance"] = "B"
pdf.save(pdfa_path)
except Exception as e:
self.log.warning(
f"PDF/A metadata stamping failed ({e!s}); falling back to plain PDF.",
)
pdfa_path.write_bytes(plain_pdf_path.read_bytes())
return pdfa_path
def _convert_pdf_to_pdfa(
self,
input_path: Path,
output_path: Path,
) -> None:
"""Convert a PDF to PDF/A using Ghostscript directly, without OCR.
Respects the user's output_type, color_conversion_strategy, and
continue_on_soft_render_error settings.
"""
from ocrmypdf._exec.ghostscript import generate_pdfa
from ocrmypdf.pdfa import generate_pdfa_ps
output_type = self.settings.output_type
if output_type == OutputTypeChoices.PDF:
# No PDF/A requested — just copy the original
copy_file_with_basic_stats(input_path, output_path)
return
# Map output_type to pdfa_part: pdfa→2, pdfa-1→1, pdfa-2→2, pdfa-3→3
pdfa_part = "2" if output_type == "pdfa" else output_type.split("-")[-1]
pdfmark = Path(self.tempdir) / "pdfa.ps"
generate_pdfa_ps(pdfmark)
color_strategy = self.settings.color_conversion_strategy or "RGB"
self.log.debug(
"Converting PDF to PDF/A-%s via Ghostscript (no OCR): %s",
pdfa_part,
input_path,
)
generate_pdfa(
pdf_pages=[pdfmark, input_path],
output_file=output_path,
compression="auto",
color_conversion_strategy=color_strategy,
pdfa_part=pdfa_part,
)
def _handle_subprocess_output_error(self, e: Exception) -> NoReturn:
"""Log context for Ghostscript failures and raise ParseError.
Called from the SubprocessOutputError handlers in parse() to avoid
duplicating the Ghostscript hint and re-raise logic.
"""
if "Ghostscript PDF/A rendering" in str(e):
self.log.warning(
"Ghostscript PDF/A rendering failed, consider setting "
"PAPERLESS_OCR_USER_ARGS: "
"'{\"continue_on_soft_render_error\": true}'",
)
raise ParseError(
f"SubprocessOutputError: {e!s}. See logs for more information.",
) from e
def parse(
self,
document_path: Path,
@@ -409,57 +500,107 @@ class RasterisedDocumentParser:
) -> None:
# This forces tesseract to use one core per page.
os.environ["OMP_THREAD_LIMIT"] = "1"
VALID_TEXT_LENGTH = 50
if mime_type == "application/pdf":
text_original = self.extract_text(None, document_path)
original_has_text = (
text_original is not None and len(text_original) > VALID_TEXT_LENGTH
)
else:
text_original = None
original_has_text = False
# If the original has text, and the user doesn't want an archive,
# we're done here
skip_archive_for_text = (
self.settings.mode == ModeChoices.SKIP_NO_ARCHIVE
or self.settings.skip_archive_file
in {
ArchiveFileChoices.WITH_TEXT,
ArchiveFileChoices.ALWAYS,
}
)
if skip_archive_for_text and original_has_text:
self.log.debug("Document has text, skipping OCRmyPDF entirely.")
self.text = text_original
return
# Either no text was in the original or there should be an archive
# file created, so OCR the file and create an archive with any
# text located via OCR
import ocrmypdf
from ocrmypdf import EncryptedPdfError
from ocrmypdf import InputFileError
from ocrmypdf import SubprocessOutputError
from ocrmypdf.exceptions import DigitalSignatureError
from ocrmypdf.exceptions import PriorOcrFoundError
if mime_type == "application/pdf":
text_original = self.extract_text(None, document_path)
original_has_text = is_tagged_pdf(document_path, log=self.log) or (
text_original is not None and len(text_original) > PDF_TEXT_MIN_LENGTH
)
else:
text_original = None
original_has_text = False
self.log.debug(
"Text detection: original_has_text=%s (text_length=%d, mode=%s, produce_archive=%s)",
original_has_text,
len(text_original) if text_original else 0,
self.settings.mode,
produce_archive,
)
# --- OCR_MODE=off: never invoke OCR engine ---
if self.settings.mode == ModeChoices.OFF:
if not produce_archive:
self.log.debug(
"OCR: skipped — OCR_MODE=off, no archive requested;"
" returning pdftotext content only",
)
self.text = text_original or ""
return
if self.is_image(mime_type):
self.log.debug(
"OCR: skipped — OCR_MODE=off, image input;"
" converting to PDF/A without OCR",
)
try:
self.archive_path = self._convert_image_to_pdfa(
document_path,
)
self.text = ""
except Exception as e:
raise ParseError(
f"Image to PDF/A conversion failed: {e!s}",
) from e
return
# PDFs in off mode: PDF/A conversion via Ghostscript, no OCR
archive_path = Path(self.tempdir) / "archive.pdf"
try:
self._convert_pdf_to_pdfa(document_path, archive_path)
self.archive_path = archive_path
self.text = text_original or ""
except SubprocessOutputError as e:
self._handle_subprocess_output_error(e)
except Exception as e:
raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
return
# --- OCR_MODE=auto: skip ocrmypdf entirely if text exists and no archive needed ---
if (
self.settings.mode == ModeChoices.AUTO
and original_has_text
and not produce_archive
):
self.log.debug(
"Document has text and no archive requested; skipping OCRmyPDF entirely.",
)
self.text = text_original
return
# --- All other paths: run ocrmypdf ---
archive_path = Path(self.tempdir) / "archive.pdf"
sidecar_file = Path(self.tempdir) / "sidecar.txt"
# auto mode with existing text: PDF/A conversion only (no OCR).
skip_text = self.settings.mode == ModeChoices.AUTO and original_has_text
if skip_text:
self.log.debug(
"OCR strategy: PDF/A conversion only (skip_text)"
" — OCR_MODE=auto, document already has text",
)
else:
self.log.debug("OCR strategy: full OCR — OCR_MODE=%s", self.settings.mode)
args = self.construct_ocrmypdf_parameters(
document_path,
mime_type,
archive_path,
sidecar_file,
skip_text=skip_text,
)
try:
self.log.debug(f"Calling OCRmyPDF with args: {args}")
ocrmypdf.ocr(**args)
if self.settings.skip_archive_file != ArchiveFileChoices.ALWAYS:
if produce_archive:
self.archive_path = archive_path
self.text = self.extract_text(sidecar_file, archive_path)
@@ -474,16 +615,8 @@ class RasterisedDocumentParser:
if original_has_text:
self.text = text_original
except SubprocessOutputError as e:
if "Ghostscript PDF/A rendering" in str(e):
self.log.warning(
"Ghostscript PDF/A rendering failed, consider setting "
"PAPERLESS_OCR_USER_ARGS: '{\"continue_on_soft_render_error\": true}'",
)
raise ParseError(
f"SubprocessOutputError: {e!s}. See logs for more information.",
) from e
except (NoTextFoundException, InputFileError) as e:
self._handle_subprocess_output_error(e)
except (NoTextFoundException, InputFileError, PriorOcrFoundError) as e:
self.log.warning(
f"Encountered an error while running OCR: {e!s}. "
f"Attempting force OCR to get the text.",
@@ -492,8 +625,6 @@ class RasterisedDocumentParser:
archive_path_fallback = Path(self.tempdir) / "archive-fallback.pdf"
sidecar_file_fallback = Path(self.tempdir) / "sidecar-fallback.txt"
# Attempt to run OCR with safe settings.
args = self.construct_ocrmypdf_parameters(
document_path,
mime_type,
@@ -505,25 +636,18 @@ class RasterisedDocumentParser:
try:
self.log.debug(f"Fallback: Calling OCRmyPDF with args: {args}")
ocrmypdf.ocr(**args)
# Don't return the archived file here, since this file
# is bigger and blurry due to --force-ocr.
self.text = self.extract_text(
sidecar_file_fallback,
archive_path_fallback,
)
if produce_archive:
self.archive_path = archive_path_fallback
except Exception as e:
# If this fails, we have a serious issue at hand.
raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
except Exception as e:
# Anything else is probably serious.
raise ParseError(f"{e.__class__.__name__}: {e!s}") from e
# As a last resort, if we still don't have any text for any reason,
# try to extract the text from the original document.
if not self.text:
if original_has_text:
self.text = text_original

View File

@@ -10,15 +10,105 @@ from __future__ import annotations
import logging
import re
import tempfile
from pathlib import Path
from typing import TYPE_CHECKING
from typing import Final
if TYPE_CHECKING:
from pathlib import Path
from paperless.parsers import MetadataEntry
logger = logging.getLogger("paperless.parsers.utils")
# Minimum character count for a PDF to be considered "born-digital" (has real text).
# Used by both the consumer (archive decision) and the tesseract parser (skip-OCR decision).
PDF_TEXT_MIN_LENGTH: Final[int] = 50
def is_tagged_pdf(
path: Path,
log: logging.Logger | None = None,
) -> bool:
"""Return True if the PDF declares itself as tagged (born-digital indicator).
Tagged PDFs (e.g. exported from Word or LibreOffice) have ``/MarkInfo``
with ``/Marked true`` in the document root. This is a reliable signal
that the document has a logical structure and embedded text — running OCR
on it is unnecessary and archive generation can be skipped.
https://github.com/ocrmypdf/OCRmyPDF/blob/4e974ebd465a5921b2e79004f098f5d203010282/src/ocrmypdf/pdfinfo/info.py#L449
Parameters
----------
path:
Absolute path to the PDF file.
log:
Logger for warnings. Falls back to the module-level logger when omitted.
Returns
-------
bool
``True`` when the PDF is tagged, ``False`` otherwise or on any error.
"""
import pikepdf
_log = log or logger
try:
with pikepdf.open(path) as pdf:
mark_info = pdf.Root.get("/MarkInfo")
if mark_info is None:
return False
return bool(mark_info.get("/Marked", False))
except Exception:
_log.warning("Could not check PDF tag status for %s", path, exc_info=True)
return False
def extract_pdf_text(
path: Path,
log: logging.Logger | None = None,
) -> str | None:
"""Run pdftotext on *path* and return the extracted text, or None on failure.
Parameters
----------
path:
Absolute path to the PDF file.
log:
Logger for warnings. Falls back to the module-level logger when omitted.
Returns
-------
str | None
Extracted text, or ``None`` if pdftotext fails or the file is not a PDF.
"""
from documents.utils import run_subprocess
_log = log or logger
try:
with tempfile.TemporaryDirectory() as tmpdir:
out_path = Path(tmpdir) / "text.txt"
run_subprocess(
[
"pdftotext",
"-q",
"-layout",
"-enc",
"UTF-8",
str(path),
str(out_path),
],
logger=_log,
)
text = read_file_handle_unicode_errors(out_path, log=_log)
return text or None
except Exception:
_log.warning(
"Error while getting text from PDF document with pdftotext",
exc_info=True,
)
return None
def read_file_handle_unicode_errors(
filepath: Path,

View File

@@ -889,10 +889,23 @@ OCR_LANGUAGE = os.getenv("PAPERLESS_OCR_LANGUAGE", "eng")
# OCRmyPDF --output-type options are available.
OCR_OUTPUT_TYPE = os.getenv("PAPERLESS_OCR_OUTPUT_TYPE", "pdfa")
# skip. redo, force
OCR_MODE = os.getenv("PAPERLESS_OCR_MODE", "skip")
if os.environ.get("PAPERLESS_OCR_MODE", "") in (
"skip",
"skip_noarchive",
): # pragma: no cover
OCR_MODE = "auto"
else:
OCR_MODE = get_choice_from_env(
"PAPERLESS_OCR_MODE",
{"auto", "force", "redo", "off"},
default="auto",
)
OCR_SKIP_ARCHIVE_FILE = os.getenv("PAPERLESS_OCR_SKIP_ARCHIVE_FILE", "never")
ARCHIVE_FILE_GENERATION = get_choice_from_env(
"PAPERLESS_ARCHIVE_FILE_GENERATION",
{"auto", "always", "never"},
default="auto",
)
OCR_IMAGE_DPI = get_int_from_env("PAPERLESS_OCR_IMAGE_DPI")

View File

@@ -708,7 +708,7 @@ def null_app_config(mocker: MockerFixture) -> MagicMock:
pages=None,
language=None,
mode=None,
skip_archive_file=None,
archive_file_generation=None,
image_dpi=None,
unpaper_clean=None,
deskew=None,

View File

@@ -0,0 +1,141 @@
"""
Tests for RasterisedDocumentParser._convert_image_to_pdfa.
The method converts an image to a PDF/A-2b file using img2pdf (wrapping)
then pikepdf (PDF/A metadata stamping), with a fallback to plain PDF when
pikepdf stamping fails. No Tesseract or Ghostscript is invoked.
These are unit/integration tests: img2pdf and pikepdf run for real; only
error-path branches mock the respective library call.
"""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
import img2pdf
import magic
import pikepdf
import pytest
from documents.parsers import ParseError
if TYPE_CHECKING:
from pytest_mock import MockerFixture
from paperless.parsers.tesseract import RasterisedDocumentParser
class TestConvertImageToPdfa:
"""_convert_image_to_pdfa: output shape, error paths, DPI handling."""
def test_valid_png_produces_pdf_bytes(
self,
tesseract_parser: RasterisedDocumentParser,
simple_png_file: Path,
) -> None:
"""
GIVEN: a valid PNG with DPI metadata
WHEN: _convert_image_to_pdfa is called
THEN: the returned file is non-empty and begins with the PDF magic bytes
"""
result = tesseract_parser._convert_image_to_pdfa(simple_png_file)
assert result.exists()
assert magic.from_file(str(result), mime=True) == "application/pdf"
def test_output_path_is_archive_pdf_in_tempdir(
self,
tesseract_parser: RasterisedDocumentParser,
simple_png_file: Path,
) -> None:
"""
GIVEN: any valid image
WHEN: _convert_image_to_pdfa is called
THEN: the returned path is exactly <tempdir>/archive.pdf
"""
result = tesseract_parser._convert_image_to_pdfa(simple_png_file)
assert result == Path(tesseract_parser.tempdir) / "archive.pdf"
def test_img2pdf_failure_raises_parse_error(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_png_file: Path,
) -> None:
"""
GIVEN: img2pdf.convert raises an exception
WHEN: _convert_image_to_pdfa is called
THEN: a ParseError is raised that mentions "img2pdf conversion failed"
"""
mocker.patch.object(img2pdf, "convert", side_effect=Exception("boom"))
with pytest.raises(ParseError, match="img2pdf conversion failed"):
tesseract_parser._convert_image_to_pdfa(simple_png_file)
def test_pikepdf_stamping_failure_falls_back_to_plain_pdf(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_png_file: Path,
) -> None:
"""
GIVEN: pikepdf.open raises during PDF/A metadata stamping
WHEN: _convert_image_to_pdfa is called
THEN: no exception is raised and the returned file is still a valid PDF
(plain PDF bytes are used as fallback)
"""
mocker.patch.object(pikepdf, "open", side_effect=Exception("pikepdf boom"))
result = tesseract_parser._convert_image_to_pdfa(simple_png_file)
assert result.exists()
assert magic.from_file(str(result), mime=True) == "application/pdf"
def test_image_dpi_setting_applies_fixed_dpi_layout(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_no_dpi_png_file: Path,
) -> None:
"""
GIVEN: parser.settings.image_dpi = 150
WHEN: _convert_image_to_pdfa is called with a no-DPI PNG
THEN: img2pdf.get_fixed_dpi_layout_fun is called with (150, 150)
and the output is still a valid PDF
"""
spy = mocker.patch.object(
img2pdf,
"get_fixed_dpi_layout_fun",
wraps=img2pdf.get_fixed_dpi_layout_fun,
)
tesseract_parser.settings.image_dpi = 150
result = tesseract_parser._convert_image_to_pdfa(simple_no_dpi_png_file)
spy.assert_called_once_with((150, 150))
assert magic.from_file(str(result), mime=True) == "application/pdf"
def test_no_image_dpi_setting_skips_fixed_dpi_layout(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_png_file: Path,
) -> None:
"""
GIVEN: parser.settings.image_dpi is None (default)
WHEN: _convert_image_to_pdfa is called
THEN: img2pdf.get_fixed_dpi_layout_fun is never called
"""
spy = mocker.patch.object(
img2pdf,
"get_fixed_dpi_layout_fun",
wraps=img2pdf.get_fixed_dpi_layout_fun,
)
tesseract_parser.settings.image_dpi = None
tesseract_parser._convert_image_to_pdfa(simple_png_file)
spy.assert_not_called()

View File

@@ -0,0 +1,440 @@
"""
Focused tests for RasterisedDocumentParser.parse() mode behaviour.
These tests mock ``ocrmypdf.ocr`` so they run without a real Tesseract/OCRmyPDF
installation and execute quickly. The intent is to verify the *control flow*
introduced by the ``produce_archive`` flag and the ``OCR_MODE=auto/off`` logic,
not to test OCRmyPDF itself.
Fixtures are pulled from conftest.py in this package.
"""
from __future__ import annotations
from pathlib import Path
from typing import TYPE_CHECKING
import pytest
if TYPE_CHECKING:
from pytest_mock import MockerFixture
from paperless.parsers.tesseract import RasterisedDocumentParser
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
_LONG_TEXT = "This is a test document with enough text. " * 5 # >50 chars
_SHORT_TEXT = "Hi." # <50 chars
def _make_extract_text(text: str | None):
"""Return a side_effect function for ``extract_text`` that returns *text*."""
def _extract(sidecar_file, pdf_file):
return text
return _extract
# ---------------------------------------------------------------------------
# AUTO mode — PDF with sufficient text layer
# ---------------------------------------------------------------------------
class TestAutoModeWithText:
"""AUTO mode, original PDF has detectable text (>50 chars)."""
def test_auto_text_no_archive_skips_ocrmypdf(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_digital_pdf_file: Path,
) -> None:
"""
GIVEN:
- AUTO mode, produce_archive=False
- PDF with text > VALID_TEXT_LENGTH
WHEN:
- parse() is called
THEN:
- ocrmypdf.ocr is NOT called (early return path)
- archive_path remains None
- text is set from the original
"""
# Patch extract_text to return long text (simulating detectable text layer)
mocker.patch.object(
tesseract_parser,
"extract_text",
return_value=_LONG_TEXT,
)
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
simple_digital_pdf_file,
"application/pdf",
produce_archive=False,
)
mock_ocr.assert_not_called()
assert tesseract_parser.archive_path is None
assert tesseract_parser.get_text() == _LONG_TEXT
def test_auto_text_with_archive_calls_ocrmypdf_skip_text(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_digital_pdf_file: Path,
) -> None:
"""
GIVEN:
- AUTO mode, produce_archive=True
- PDF with text > VALID_TEXT_LENGTH
WHEN:
- parse() is called
THEN:
- ocrmypdf.ocr IS called with skip_text=True
- archive_path is set
"""
mocker.patch.object(
tesseract_parser,
"extract_text",
return_value=_LONG_TEXT,
)
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
simple_digital_pdf_file,
"application/pdf",
produce_archive=True,
)
mock_ocr.assert_called_once()
call_kwargs = mock_ocr.call_args.kwargs
assert call_kwargs.get("skip_text") is True
assert "force_ocr" not in call_kwargs
assert "redo_ocr" not in call_kwargs
assert tesseract_parser.archive_path is not None
# ---------------------------------------------------------------------------
# AUTO mode — PDF without text layer (or too short)
# ---------------------------------------------------------------------------
class TestAutoModeNoText:
"""AUTO mode, original PDF has no detectable text (<= 50 chars)."""
def test_auto_no_text_with_archive_calls_ocrmypdf_no_extra_flag(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
multi_page_images_pdf_file: Path,
) -> None:
"""
GIVEN:
- AUTO mode, produce_archive=True
- PDF with no text (or text <= VALID_TEXT_LENGTH)
WHEN:
- parse() is called
THEN:
- ocrmypdf.ocr IS called WITHOUT skip_text/force_ocr/redo_ocr
- archive_path is set (since produce_archive=True)
"""
# Return "no text" for the original; return real text for archive
extract_call_count = 0
def _extract_side(sidecar_file, pdf_file):
nonlocal extract_call_count
extract_call_count += 1
if extract_call_count == 1:
return None # original has no text
return _LONG_TEXT # text from archive after OCR
mocker.patch.object(tesseract_parser, "extract_text", side_effect=_extract_side)
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
multi_page_images_pdf_file,
"application/pdf",
produce_archive=True,
)
mock_ocr.assert_called_once()
call_kwargs = mock_ocr.call_args.kwargs
assert "skip_text" not in call_kwargs
assert "force_ocr" not in call_kwargs
assert "redo_ocr" not in call_kwargs
assert tesseract_parser.archive_path is not None
def test_auto_no_text_no_archive_calls_ocrmypdf(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
multi_page_images_pdf_file: Path,
) -> None:
"""
GIVEN:
- AUTO mode, produce_archive=False
- PDF with no text
WHEN:
- parse() is called
THEN:
- ocrmypdf.ocr IS called (no early return since no text detected)
- archive_path is NOT set (produce_archive=False)
"""
extract_call_count = 0
def _extract_side(sidecar_file, pdf_file):
nonlocal extract_call_count
extract_call_count += 1
if extract_call_count == 1:
return None
return _LONG_TEXT
mocker.patch.object(tesseract_parser, "extract_text", side_effect=_extract_side)
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
multi_page_images_pdf_file,
"application/pdf",
produce_archive=False,
)
mock_ocr.assert_called_once()
assert tesseract_parser.archive_path is None
# ---------------------------------------------------------------------------
# OFF mode — PDF
# ---------------------------------------------------------------------------
class TestOffModePdf:
"""OCR_MODE=off, document is a PDF."""
def test_off_no_archive_returns_pdftotext(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_digital_pdf_file: Path,
) -> None:
"""
GIVEN:
- OFF mode, produce_archive=False
- PDF with text
WHEN:
- parse() is called
THEN:
- ocrmypdf.ocr is NOT called
- archive_path is None
- text comes from pdftotext (extract_text)
"""
mocker.patch.object(
tesseract_parser,
"extract_text",
return_value=_LONG_TEXT,
)
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = "off"
tesseract_parser.parse(
simple_digital_pdf_file,
"application/pdf",
produce_archive=False,
)
mock_ocr.assert_not_called()
assert tesseract_parser.archive_path is None
assert tesseract_parser.get_text() == _LONG_TEXT
def test_off_with_archive_uses_ghostscript_not_ocr(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_digital_pdf_file: Path,
) -> None:
"""
GIVEN:
- OFF mode, produce_archive=True
- PDF document
WHEN:
- parse() is called
THEN:
- ocrmypdf.ocr is NOT called
- Ghostscript generate_pdfa IS called (PDF/A conversion without OCR)
- archive_path is set
- text comes from pdftotext, not OCR
"""
mocker.patch.object(
tesseract_parser,
"extract_text",
return_value=_LONG_TEXT,
)
mock_ocr = mocker.patch("ocrmypdf.ocr")
mock_gs = mocker.patch(
"ocrmypdf._exec.ghostscript.generate_pdfa",
)
mocker.patch("ocrmypdf.pdfa.generate_pdfa_ps")
tesseract_parser.settings.mode = "off"
tesseract_parser.parse(
simple_digital_pdf_file,
"application/pdf",
produce_archive=True,
)
mock_ocr.assert_not_called()
mock_gs.assert_called_once()
assert tesseract_parser.archive_path is not None
assert tesseract_parser.get_text() == _LONG_TEXT
# ---------------------------------------------------------------------------
# OFF mode — image
# ---------------------------------------------------------------------------
class TestOffModeImage:
"""OCR_MODE=off, document is an image (PNG)."""
def test_off_image_no_archive_no_ocrmypdf(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_png_file: Path,
) -> None:
"""
GIVEN:
- OFF mode, produce_archive=False
- Image document (PNG)
WHEN:
- parse() is called
THEN:
- ocrmypdf.ocr is NOT called
- archive_path is None
- text is empty string (images have no text layer)
"""
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = "off"
tesseract_parser.parse(simple_png_file, "image/png", produce_archive=False)
mock_ocr.assert_not_called()
assert tesseract_parser.archive_path is None
assert tesseract_parser.get_text() == ""
def test_off_image_with_archive_uses_img2pdf_path(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_png_file: Path,
) -> None:
"""
GIVEN:
- OFF mode, produce_archive=True
- Image document (PNG)
WHEN:
- parse() is called
THEN:
- _convert_image_to_pdfa() is called instead of ocrmypdf.ocr
- archive_path is set to the returned path
- text is empty string
"""
fake_archive = Path("/tmp/fake-archive.pdf")
mock_convert = mocker.patch.object(
tesseract_parser,
"_convert_image_to_pdfa",
return_value=fake_archive,
)
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = "off"
tesseract_parser.parse(simple_png_file, "image/png", produce_archive=True)
mock_convert.assert_called_once_with(simple_png_file)
mock_ocr.assert_not_called()
assert tesseract_parser.archive_path == fake_archive
assert tesseract_parser.get_text() == ""
# ---------------------------------------------------------------------------
# produce_archive=False never sets archive_path for FORCE / REDO / AUTO modes
# ---------------------------------------------------------------------------
class TestProduceArchiveFalse:
"""Verify produce_archive=False never results in an archive regardless of mode."""
@pytest.mark.parametrize("mode", ["force", "redo"])
def test_produce_archive_false_force_redo_modes(
self,
mode: str,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
multi_page_images_pdf_file: Path,
) -> None:
"""
GIVEN:
- FORCE or REDO mode, produce_archive=False
- Any PDF
WHEN:
- parse() is called (ocrmypdf mocked to succeed)
THEN:
- archive_path is NOT set even though ocrmypdf ran
"""
mocker.patch.object(
tesseract_parser,
"extract_text",
return_value=_LONG_TEXT,
)
mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = mode
tesseract_parser.parse(
multi_page_images_pdf_file,
"application/pdf",
produce_archive=False,
)
assert tesseract_parser.archive_path is None
assert tesseract_parser.get_text() is not None
def test_produce_archive_false_auto_with_text(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
simple_digital_pdf_file: Path,
) -> None:
"""
GIVEN:
- AUTO mode, produce_archive=False
- PDF with text > VALID_TEXT_LENGTH
WHEN:
- parse() is called
THEN:
- ocrmypdf is skipped entirely (early return)
- archive_path is None
"""
mocker.patch.object(
tesseract_parser,
"extract_text",
return_value=_LONG_TEXT,
)
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
simple_digital_pdf_file,
"application/pdf",
produce_archive=False,
)
mock_ocr.assert_not_called()
assert tesseract_parser.archive_path is None

View File

@@ -94,15 +94,35 @@ class TestParserSettingsFromDb(DirectoriesMixin, FileSystemAssertsMixin, TestCas
WHEN:
- OCR parameters are constructed
THEN:
- Configuration from database is utilized
- Configuration from database is utilized (AUTO mode with skip_text=True
triggers skip_text; AUTO mode alone does not add any extra flag)
"""
# AUTO mode with skip_text=True explicitly passed: skip_text is set
with override_settings(OCR_MODE="redo"):
instance = ApplicationConfiguration.objects.all().first()
instance.mode = ModeChoices.SKIP
instance.mode = ModeChoices.AUTO
instance.save()
params = RasterisedDocumentParser(None).construct_ocrmypdf_parameters(
input_file="input.pdf",
output_file="output.pdf",
sidecar_file="sidecar.txt",
mime_type="application/pdf",
safe_fallback=False,
skip_text=True,
)
self.assertTrue(params["skip_text"])
self.assertNotIn("redo_ocr", params)
self.assertNotIn("force_ocr", params)
# AUTO mode alone (no skip_text): no extra OCR flag is set
with override_settings(OCR_MODE="redo"):
instance = ApplicationConfiguration.objects.all().first()
instance.mode = ModeChoices.AUTO
instance.save()
params = self.get_params()
self.assertTrue(params["skip_text"])
self.assertNotIn("skip_text", params)
self.assertNotIn("redo_ocr", params)
self.assertNotIn("force_ocr", params)

View File

@@ -370,15 +370,26 @@ class TestParsePdf:
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
) -> None:
"""
GIVEN:
- Multi-page digital PDF with sufficient text layer
- Default settings (mode=auto, produce_archive=True)
WHEN:
- Document is parsed
THEN:
- Archive is created (AUTO mode + text present + produce_archive=True
→ PDF/A conversion via skip_text)
- Text is extracted
"""
tesseract_parser.parse(
tesseract_samples_dir / "simple-digital.pdf",
tesseract_samples_dir / "multi-page-digital.pdf",
"application/pdf",
)
assert tesseract_parser.archive_path is not None
assert tesseract_parser.archive_path.is_file()
assert_ordered_substrings(
tesseract_parser.get_text(),
["This is a test document."],
tesseract_parser.get_text().lower(),
["page 1", "page 2", "page 3"],
)
def test_with_form_default(
@@ -397,7 +408,7 @@ class TestParsePdf:
["Please enter your name in here:", "This is a PDF document with a form."],
)
def test_with_form_redo_produces_no_archive(
def test_with_form_redo_no_archive_when_not_requested(
self,
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
@@ -406,6 +417,7 @@ class TestParsePdf:
tesseract_parser.parse(
tesseract_samples_dir / "with-form.pdf",
"application/pdf",
produce_archive=False,
)
assert tesseract_parser.archive_path is None
assert_ordered_substrings(
@@ -433,7 +445,7 @@ class TestParsePdf:
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
) -> None:
tesseract_parser.settings.mode = "skip"
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(tesseract_samples_dir / "signed.pdf", "application/pdf")
assert tesseract_parser.archive_path is None
assert_ordered_substrings(
@@ -449,7 +461,7 @@ class TestParsePdf:
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
) -> None:
tesseract_parser.settings.mode = "skip"
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
tesseract_samples_dir / "encrypted.pdf",
"application/pdf",
@@ -559,7 +571,7 @@ class TestParseMultiPage:
@pytest.mark.parametrize(
"mode",
[
pytest.param("skip", id="skip"),
pytest.param("auto", id="auto"),
pytest.param("redo", id="redo"),
pytest.param("force", id="force"),
],
@@ -587,7 +599,7 @@ class TestParseMultiPage:
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
) -> None:
tesseract_parser.settings.mode = "skip"
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
tesseract_samples_dir / "multi-page-images.pdf",
"application/pdf",
@@ -735,16 +747,18 @@ class TestSkipArchive:
"""
GIVEN:
- File with existing text layer
- Mode: skip_noarchive
- Mode: auto, produce_archive=False
WHEN:
- Document is parsed
THEN:
- Text extracted; no archive created
- Text extracted from original; no archive created (text exists +
produce_archive=False skips OCRmyPDF entirely)
"""
tesseract_parser.settings.mode = "skip_noarchive"
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
tesseract_samples_dir / "multi-page-digital.pdf",
"application/pdf",
produce_archive=False,
)
assert tesseract_parser.archive_path is None
assert_ordered_substrings(
@@ -760,13 +774,13 @@ class TestSkipArchive:
"""
GIVEN:
- File with image-only pages (no text layer)
- Mode: skip_noarchive
- Mode: auto, skip_archive_file: auto
WHEN:
- Document is parsed
THEN:
- Text extracted; archive created (OCR needed)
- Text extracted; archive created (OCR needed, no existing text)
"""
tesseract_parser.settings.mode = "skip_noarchive"
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
tesseract_samples_dir / "multi-page-images.pdf",
"application/pdf",
@@ -778,41 +792,58 @@ class TestSkipArchive:
)
@pytest.mark.parametrize(
("skip_archive_file", "filename", "expect_archive"),
("produce_archive", "filename", "expect_archive"),
[
pytest.param("never", "multi-page-digital.pdf", True, id="never-with-text"),
pytest.param("never", "multi-page-images.pdf", True, id="never-no-text"),
pytest.param(
"with_text",
True,
"multi-page-digital.pdf",
False,
id="with-text-layer",
True,
id="produce-archive-with-text",
),
pytest.param(
"with_text",
True,
"multi-page-images.pdf",
True,
id="with-text-no-layer",
id="produce-archive-no-text",
),
pytest.param(
"always",
False,
"multi-page-digital.pdf",
False,
id="always-with-text",
id="no-archive-with-text-layer",
),
pytest.param(
False,
"multi-page-images.pdf",
False,
id="no-archive-no-text-layer",
),
pytest.param("always", "multi-page-images.pdf", False, id="always-no-text"),
],
)
def test_skip_archive_file_setting(
def test_produce_archive_flag(
self,
skip_archive_file: str,
produce_archive: bool, # noqa: FBT001
filename: str,
expect_archive: str,
expect_archive: bool, # noqa: FBT001
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
) -> None:
tesseract_parser.settings.skip_archive_file = skip_archive_file
tesseract_parser.parse(tesseract_samples_dir / filename, "application/pdf")
"""
GIVEN:
- Various PDFs (with and without text layers)
- produce_archive flag set to True or False
WHEN:
- Document is parsed
THEN:
- archive_path is set if and only if produce_archive=True
- Text is always extracted
"""
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
tesseract_samples_dir / filename,
"application/pdf",
produce_archive=produce_archive,
)
text = tesseract_parser.get_text().lower()
assert_ordered_substrings(text, ["page 1", "page 2", "page 3"])
if expect_archive:
@@ -820,6 +851,59 @@ class TestSkipArchive:
else:
assert tesseract_parser.archive_path is None
def test_tagged_pdf_skips_ocr_in_auto_mode(
self,
mocker: MockerFixture,
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
) -> None:
"""
GIVEN:
- A tagged PDF (e.g. exported from Word, /MarkInfo /Marked true)
- Mode: auto, produce_archive=False
WHEN:
- Document is parsed
THEN:
- OCRmyPDF is not invoked (tagged ⇒ original_has_text=True)
- Text is extracted from the original via pdftotext
- No archive is produced
"""
tesseract_parser.settings.mode = "auto"
mock_ocr = mocker.patch("ocrmypdf.ocr")
tesseract_parser.parse(
tesseract_samples_dir / "simple-digital.pdf",
"application/pdf",
produce_archive=False,
)
mock_ocr.assert_not_called()
assert tesseract_parser.archive_path is None
assert tesseract_parser.get_text()
def test_tagged_pdf_produces_pdfa_archive_without_ocr(
self,
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
) -> None:
"""
GIVEN:
- A tagged PDF (e.g. exported from Word, /MarkInfo /Marked true)
- Mode: auto, produce_archive=True
WHEN:
- Document is parsed
THEN:
- OCRmyPDF runs with skip_text (PDF/A conversion only, no OCR)
- Archive is produced
- Text is preserved from the original
"""
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
tesseract_samples_dir / "simple-digital.pdf",
"application/pdf",
produce_archive=True,
)
assert tesseract_parser.archive_path is not None
assert tesseract_parser.get_text()
# ---------------------------------------------------------------------------
# Parse — mixed pages / sidecar
@@ -835,13 +919,13 @@ class TestParseMixed:
"""
GIVEN:
- File with text in some pages (image) and some pages (digital)
- Mode: skip
- Mode: auto (skip_text), skip_archive_file: always
WHEN:
- Document is parsed
THEN:
- All pages extracted; archive created; sidecar notes skipped pages
"""
tesseract_parser.settings.mode = "skip"
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
tesseract_samples_dir / "multi-page-mixed.pdf",
"application/pdf",
@@ -898,17 +982,18 @@ class TestParseMixed:
) -> None:
"""
GIVEN:
- File with mixed pages
- Mode: skip_noarchive
- File with mixed pages (some with text, some image-only)
- Mode: auto, produce_archive=False
WHEN:
- Document is parsed
THEN:
- No archive created (file has text layer); later-page text present
- No archive created (produce_archive=False); text from text layer present
"""
tesseract_parser.settings.mode = "skip_noarchive"
tesseract_parser.settings.mode = "auto"
tesseract_parser.parse(
tesseract_samples_dir / "multi-page-mixed.pdf",
"application/pdf",
produce_archive=False,
)
assert tesseract_parser.archive_path is None
assert_ordered_substrings(
@@ -923,12 +1008,12 @@ class TestParseMixed:
class TestParseRotate:
def test_rotate_skip_mode(
def test_rotate_auto_mode(
self,
tesseract_parser: RasterisedDocumentParser,
tesseract_samples_dir: Path,
) -> None:
tesseract_parser.settings.mode = "skip"
tesseract_parser.settings.mode = "auto"
tesseract_parser.settings.rotate = True
tesseract_parser.parse(tesseract_samples_dir / "rotated.pdf", "application/pdf")
assert_ordered_substrings(
@@ -955,12 +1040,19 @@ class TestParseRtl:
) -> None:
"""
GIVEN:
- PDF with RTL Arabic text
- PDF with RTL Arabic text in its text layer (short: 18 chars)
- mode=off, produce_archive=True: PDF/A conversion via skip_text, no OCR engine
WHEN:
- Document is parsed
THEN:
- Arabic content is extracted (normalised for bidi)
- Arabic content is extracted from the PDF text layer (normalised for bidi)
Note: The RTL PDF has a short text layer (< VALID_TEXT_LENGTH=50) so AUTO mode
would attempt full OCR, which fails due to PriorOcrFoundError and falls back to
force-ocr with English Tesseract (producing garbage). Using mode="off" forces
skip_text=True so the Arabic text layer is preserved through PDF/A conversion.
"""
tesseract_parser.settings.mode = "off"
tesseract_parser.parse(
tesseract_samples_dir / "rtl-test.pdf",
"application/pdf",
@@ -971,7 +1063,8 @@ class TestParseRtl:
if unicodedata.category(ch) != "Cf" and not ch.isspace()
)
assert "ةرازو" in normalised
assert any(token in normalised for token in ("ةیلخادلا", "الاخليد"))
# pdftotext uses Arabic Yeh (U+064A) where ocrmypdf used Farsi Yeh (U+06CC)
assert any(token in normalised for token in ("ةیلخادلا", "الاخليد", "ةيلخادال"))
# ---------------------------------------------------------------------------
@@ -1023,11 +1116,11 @@ class TestOcrmypdfParameters:
assert ("clean" in params) == expected_clean
assert ("clean_final" in params) == expected_clean_final
def test_clean_final_skip_mode(
def test_clean_final_auto_mode(
self,
make_tesseract_parser: MakeTesseractParser,
) -> None:
with make_tesseract_parser(OCR_CLEAN="clean-final", OCR_MODE="skip") as parser:
with make_tesseract_parser(OCR_CLEAN="clean-final", OCR_MODE="auto") as parser:
params = parser.construct_ocrmypdf_parameters("", "", "", "")
assert params["clean_final"] is True
assert "clean" not in params
@@ -1044,9 +1137,9 @@ class TestOcrmypdfParameters:
@pytest.mark.parametrize(
("ocr_mode", "ocr_deskew", "expect_deskew"),
[
pytest.param("skip", True, True, id="skip-deskew-on"),
pytest.param("auto", True, True, id="auto-deskew-on"),
pytest.param("redo", True, False, id="redo-deskew-off"),
pytest.param("skip", False, False, id="skip-no-deskew"),
pytest.param("auto", False, False, id="auto-no-deskew"),
],
)
def test_deskew_option(

View File

@@ -132,13 +132,13 @@ class TestOcrSettingsChecks:
pytest.param(
"OCR_MODE",
"skip_noarchive",
"deprecated",
id="deprecated-mode",
'OCR output mode "skip_noarchive"',
id="deprecated-mode-now-invalid",
),
pytest.param(
"OCR_SKIP_ARCHIVE_FILE",
"ARCHIVE_FILE_GENERATION",
"invalid",
'OCR_SKIP_ARCHIVE_FILE setting "invalid"',
'PAPERLESS_ARCHIVE_FILE_GENERATION setting "invalid"',
id="invalid-skip-archive-file",
),
pytest.param(

View File

@@ -0,0 +1,64 @@
"""Tests for v3 system checks: deprecated v2 OCR env var warnings."""
from __future__ import annotations
import os
from typing import TYPE_CHECKING
import pytest
from paperless.checks import check_deprecated_v2_ocr_env_vars
if TYPE_CHECKING:
from pytest_mock import MockerFixture
class TestDeprecatedV2OcrEnvVarWarnings:
def test_no_deprecated_vars_returns_empty(self, mocker: MockerFixture) -> None:
"""No warnings when neither deprecated variable is set."""
mocker.patch.dict(os.environ, {"PAPERLESS_OCR_MODE": "auto"}, clear=True)
result = check_deprecated_v2_ocr_env_vars(None)
assert result == []
@pytest.mark.parametrize(
("env_var", "env_value", "expected_id", "expected_fragment"),
[
pytest.param(
"PAPERLESS_OCR_SKIP_ARCHIVE_FILE",
"always",
"paperless.W002",
"PAPERLESS_OCR_SKIP_ARCHIVE_FILE",
id="skip-archive-file-warns",
),
pytest.param(
"PAPERLESS_OCR_MODE",
"skip",
"paperless.W003",
"skip",
id="ocr-mode-skip-warns",
),
pytest.param(
"PAPERLESS_OCR_MODE",
"skip_noarchive",
"paperless.W003",
"skip_noarchive",
id="ocr-mode-skip-noarchive-warns",
),
],
)
def test_deprecated_var_produces_one_warning(
self,
mocker: MockerFixture,
env_var: str,
env_value: str,
expected_id: str,
expected_fragment: str,
) -> None:
"""Each deprecated setting in isolation produces exactly one warning."""
mocker.patch.dict(os.environ, {env_var: env_value}, clear=True)
result = check_deprecated_v2_ocr_env_vars(None)
assert len(result) == 1
warning = result[0]
assert warning.id == expected_id
assert expected_fragment in warning.msg

View File

@@ -0,0 +1,89 @@
from documents.tests.utils import TestMigrations
class TestMigrateSkipArchiveFile(TestMigrations):
migrate_from = "0007_optimize_integer_field_sizes"
migrate_to = "0008_replace_skip_archive_file"
def setUpBeforeMigration(self, apps):
ApplicationConfiguration = apps.get_model(
"paperless",
"ApplicationConfiguration",
)
ApplicationConfiguration.objects.all().delete()
ApplicationConfiguration.objects.create(
pk=1,
mode="skip",
skip_archive_file="always",
)
ApplicationConfiguration.objects.create(
pk=2,
mode="redo",
skip_archive_file="with_text",
)
ApplicationConfiguration.objects.create(
pk=3,
mode="force",
skip_archive_file="never",
)
ApplicationConfiguration.objects.create(
pk=4,
mode="skip_noarchive",
skip_archive_file=None,
)
ApplicationConfiguration.objects.create(
pk=5,
mode="skip_noarchive",
skip_archive_file="never",
)
ApplicationConfiguration.objects.create(pk=6, mode=None, skip_archive_file=None)
def _get_config(self, pk):
ApplicationConfiguration = self.apps.get_model(
"paperless",
"ApplicationConfiguration",
)
return ApplicationConfiguration.objects.get(pk=pk)
def test_skip_mapped_to_auto(self):
config = self._get_config(1)
assert config.mode == "auto"
def test_skip_archive_always_mapped_to_never(self):
config = self._get_config(1)
assert config.archive_file_generation == "never"
def test_redo_unchanged(self):
config = self._get_config(2)
assert config.mode == "redo"
def test_skip_archive_with_text_mapped_to_auto(self):
config = self._get_config(2)
assert config.archive_file_generation == "auto"
def test_force_unchanged(self):
config = self._get_config(3)
assert config.mode == "force"
def test_skip_archive_never_mapped_to_always(self):
config = self._get_config(3)
assert config.archive_file_generation == "always"
def test_skip_noarchive_mapped_to_auto(self):
config = self._get_config(4)
assert config.mode == "auto"
def test_skip_noarchive_implies_archive_never(self):
config = self._get_config(4)
assert config.archive_file_generation == "never"
def test_skip_noarchive_explicit_skip_archive_takes_precedence(self):
"""skip_archive_file=never maps to always, not overridden by skip_noarchive."""
config = self._get_config(5)
assert config.mode == "auto"
assert config.archive_file_generation == "always"
def test_null_values_remain_null(self):
config = self._get_config(6)
assert config.mode is None
assert config.archive_file_generation is None

View File

@@ -0,0 +1,66 @@
"""Tests for OcrConfig archive_file_generation field behavior."""
from __future__ import annotations
from typing import TYPE_CHECKING
import pytest
from django.test import override_settings
from paperless.config import OcrConfig
if TYPE_CHECKING:
from unittest.mock import MagicMock
@pytest.fixture()
def null_app_config(mocker) -> MagicMock:
"""Mock ApplicationConfiguration with all fields None → falls back to Django settings."""
return mocker.MagicMock(
output_type=None,
pages=None,
language=None,
mode=None,
archive_file_generation=None,
image_dpi=None,
unpaper_clean=None,
deskew=None,
rotate_pages=None,
rotate_pages_threshold=None,
max_image_pixels=None,
color_conversion_strategy=None,
user_args=None,
)
@pytest.fixture()
def make_ocr_config(mocker, null_app_config):
mocker.patch(
"paperless.config.BaseConfig._get_config_instance",
return_value=null_app_config,
)
def _make(**django_settings_overrides):
with override_settings(**django_settings_overrides):
return OcrConfig()
return _make
class TestOcrConfigArchiveFileGeneration:
def test_auto_from_settings(self, make_ocr_config) -> None:
cfg = make_ocr_config(OCR_MODE="auto", ARCHIVE_FILE_GENERATION="auto")
assert cfg.archive_file_generation == "auto"
def test_always_from_settings(self, make_ocr_config) -> None:
cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="always")
assert cfg.archive_file_generation == "always"
def test_never_from_settings(self, make_ocr_config) -> None:
cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="never")
assert cfg.archive_file_generation == "never"
def test_db_value_overrides_setting(self, make_ocr_config, null_app_config) -> None:
null_app_config.archive_file_generation = "never"
cfg = make_ocr_config(ARCHIVE_FILE_GENERATION="always")
assert cfg.archive_file_generation == "never"

View File

@@ -0,0 +1,25 @@
"""Tests for paperless.parsers.utils helpers."""
from __future__ import annotations
from pathlib import Path
from paperless.parsers.utils import is_tagged_pdf
SAMPLES = Path(__file__).parent / "samples" / "tesseract"
class TestIsTaggedPdf:
def test_tagged_pdf_returns_true(self) -> None:
assert is_tagged_pdf(SAMPLES / "simple-digital.pdf") is True
def test_untagged_pdf_returns_false(self) -> None:
assert is_tagged_pdf(SAMPLES / "multi-page-images.pdf") is False
def test_nonexistent_path_returns_false(self) -> None:
assert is_tagged_pdf(Path("/nonexistent/file.pdf")) is False
def test_corrupt_pdf_returns_false(self, tmp_path: Path) -> None:
bad = tmp_path / "bad.pdf"
bad.write_bytes(b"not a pdf")
assert is_tagged_pdf(bad) is False