diff --git a/docs/configuration.md b/docs/configuration.md index 4ce2d9dc6..1250f7109 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -801,11 +801,14 @@ parsing documents. #### [`PAPERLESS_OCR_MODE=`](#PAPERLESS_OCR_MODE) {#PAPERLESS_OCR_MODE} -: Tell paperless when and how to perform ocr on your documents. Three +: Tell paperless when and how to perform ocr on your documents. Four modes are available: - - `skip`: Paperless skips all pages and will perform ocr only on - pages where no text is present. This is the safest option. + - `auto` (default): Paperless detects whether a document already + has embedded text via pdftotext. If sufficient text is found, + OCR is skipped for that document (`--skip-text`). If no text is + present, OCR runs normally. This is the safest option for mixed + document collections. - `redo`: Paperless will OCR all pages of your documents and attempt to replace any existing text layers with new text. This @@ -823,24 +826,39 @@ modes are available: significantly larger and text won't appear as sharp when zoomed in. - The default is `skip`, which only performs OCR when necessary and - always creates archived documents. + - `off`: Paperless never invokes the OCR engine. For PDFs, text + is extracted via pdftotext only. For image documents, text will + be empty. Archive file generation still works via format + conversion (no Tesseract or Ghostscript required). + + The default is `auto`. Read more about this in the [OCRmyPDF documentation](https://ocrmypdf.readthedocs.io/en/latest/advanced.html#when-ocr-is-skipped). -#### [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE=`](#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) {#PAPERLESS_OCR_SKIP_ARCHIVE_FILE} +#### [`PAPERLESS_ARCHIVE_FILE_GENERATION=`](#PAPERLESS_ARCHIVE_FILE_GENERATION) {#PAPERLESS_ARCHIVE_FILE_GENERATION} -: Specify when you would like paperless to skip creating an archived -version of your documents. This is useful if you don't want to have two -almost-identical versions of your documents in the media folder. +: Controls when paperless creates a PDF/A archive version of your +documents. Archive files are stored alongside the original and are used +for display in the web interface. - - `never`: Never skip creating an archived version. - - `with_text`: Skip creating an archived version for documents - that already have embedded text. - - `always`: Always skip creating an archived version. + - `auto` (default): Produce archives for scanned or image-based + documents. Skip archive generation for born-digital PDFs that + already contain embedded text. This is the recommended setting + for mixed document collections. + - `always`: Always produce a PDF/A archive when the parser + supports it, regardless of whether the document already has + text. + - `never`: Never produce an archive. Only the original file is + stored. Saves disk space but the web viewer will display the + original file directly. - The default is `never`. + !!! note + + This setting only applies to parsers that can produce archives + (e.g. the Tesseract/OCR parser). Parsers that must convert + documents to PDF for display (e.g. DOCX, ODT via Tika) will + always produce a PDF regardless of this setting. #### [`PAPERLESS_OCR_CLEAN=`](#PAPERLESS_OCR_CLEAN) {#PAPERLESS_OCR_CLEAN} diff --git a/docs/migration-v3.md b/docs/migration-v3.md index 4c728a6a4..014b229ba 100644 --- a/docs/migration-v3.md +++ b/docs/migration-v3.md @@ -104,6 +104,58 @@ Multiple options are combined in a single value: PAPERLESS_DB_OPTIONS="sslmode=require;sslrootcert=/certs/ca.pem;pool.max_size=10" ``` +## OCR and Archive File Generation Settings + +The settings that control OCR behaviour and archive file generation have been redesigned. The old settings that coupled these two concerns together are **removed** — there are no migration shims. + +### Removed settings + +| Removed Setting | Replacement | +| ------------------------------------------- | --------------------------------------------------------------------- | +| `PAPERLESS_OCR_MODE=skip` | `PAPERLESS_OCR_MODE=auto` (new default) | +| `PAPERLESS_OCR_MODE=skip_noarchive` | `PAPERLESS_OCR_MODE=auto` + `PAPERLESS_ARCHIVE_FILE_GENERATION=never` | +| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=never` | `PAPERLESS_ARCHIVE_FILE_GENERATION=always` | +| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text` | `PAPERLESS_ARCHIVE_FILE_GENERATION=auto` (new default) | +| `PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always` | `PAPERLESS_ARCHIVE_FILE_GENERATION=never` | + +### What changed and why + +Previously, `OCR_MODE` conflated two independent concerns: whether to run OCR and whether to produce an archive. `skip` meant "skip OCR if text exists, but always produce an archive". `skip_noarchive` meant "skip OCR if text exists, and also skip the archive". This made it impossible to, for example, disable OCR entirely while still producing archives. + +The new settings are independent: + +- [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) controls OCR: `auto` (default), `force`, `redo`, `off`. +- [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) controls archive production: `auto` (default), `always`, `never`. + +### Action Required + +Remove any `PAPERLESS_OCR_SKIP_ARCHIVE_FILE` variable from your environment. If you relied on `OCR_MODE=skip` or `OCR_MODE=skip_noarchive`, update accordingly: + +```bash +# v2: skip OCR when text present, always archive +PAPERLESS_OCR_MODE=skip +# v3: equivalent (auto is the new default) +# No change needed — auto is the default + +# v2: skip OCR when text present, skip archive too +PAPERLESS_OCR_MODE=skip_noarchive +# v3: equivalent +PAPERLESS_OCR_MODE=auto +PAPERLESS_ARCHIVE_FILE_GENERATION=never + +# v2: always skip archive +PAPERLESS_OCR_SKIP_ARCHIVE_FILE=always +# v3: equivalent +PAPERLESS_ARCHIVE_FILE_GENERATION=never + +# v2: skip archive only for born-digital docs +PAPERLESS_OCR_SKIP_ARCHIVE_FILE=with_text +# v3: equivalent (auto is the new default) +PAPERLESS_ARCHIVE_FILE_GENERATION=auto +``` + +Paperless will emit a startup warning if the old environment variables are still set. + ## OpenID Connect Token Endpoint Authentication Some existing OpenID Connect setups may require an explicit token endpoint authentication method after upgrading to v3. diff --git a/docs/setup.md b/docs/setup.md index 3b84fd729..5580dde92 100644 --- a/docs/setup.md +++ b/docs/setup.md @@ -633,12 +633,11 @@ hardware, but a few settings can improve performance: consumption, so you might want to lower these settings (example: 2 workers and 1 thread to always have some computing power left for other tasks). -- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `skip` and consider +- Keep [`PAPERLESS_OCR_MODE`](configuration.md#PAPERLESS_OCR_MODE) at its default value `auto` and consider OCRing your documents before feeding them into Paperless. Some scanners are able to do this! -- Set [`PAPERLESS_OCR_SKIP_ARCHIVE_FILE`](configuration.md#PAPERLESS_OCR_SKIP_ARCHIVE_FILE) to `with_text` to skip archive - file generation for already OCRed documents, or `always` to skip it - for all documents. +- Set [`PAPERLESS_ARCHIVE_FILE_GENERATION`](configuration.md#PAPERLESS_ARCHIVE_FILE_GENERATION) to `never` to skip archive + file generation entirely, saving disk space at the cost of in-browser PDF/A viewing. - If you want to perform OCR on the device, consider using `PAPERLESS_OCR_CLEAN=none`. This will speed up OCR times and use less memory at the expense of slightly worse OCR results.