TextDocumentParser.__init__ accepts logging_group: object = None, same
as RemoteDocumentParser. The old shim incorrectly dropped it; fix to
forward it as a positional arg and only drop progress_callback.
Add type annotations and from __future__ import annotations for
consistency with the remote parser signals shim.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
consumer.py calls parser_class(logging_group, progress_callback=...).
RemoteDocumentParser.__init__ accepts logging_group but not
progress_callback, so only the latter is dropped — matching the pattern
established by the TextDocumentParser signals shim.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- remote.py: add `if TYPE_CHECKING: assert` guards before the Azure
client construction to narrow config.endpoint and config.api_key from
str|None to str. The narrowing is safe: engine_is_valid() guarantees
both are non-None when it returns True (api_key explicitly; endpoint
via `not (engine=="azureai" and endpoint is None)` for the only valid
engine). Asserts are wrapped in TYPE_CHECKING so they carry zero
runtime cost.
- signals.py: add full type annotations — return types, Any-typed
sender parameter, and explicit logging_group argument replacing *args.
Add `from __future__ import annotations` for consistent annotation style.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- paperless_remote/signals.py: import from paperless.parsers.remote
(new location after git mv). supported_mime_types() is now a
classmethod that always returns the full set, so get_supported_mime_types()
in the signal layer explicitly checks RemoteEngineConfig validity and
returns {} when unconfigured — preserving the old behaviour where an
unconfigured remote parser does not register for any MIME types.
- documents/consumer.py: extend the _parser_cleanup() shim, parse()
dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser
alongside TextDocumentParser. Both new-style parsers use __exit__
for cleanup and take (document_path, mime_type) without a file_name
argument.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- make_azure_mock moved from conftest.py back into test_remote_parser.py;
it is specific to that module and does not belong in shared fixtures
- azure_client fixture composes azure_settings + make_azure_mock + patch
in one step; tests no longer repeat the mocker.patch call or carry an
unused azure_settings parameter
- failing_azure_client fixture similarly composes azure_settings + patch
with a RuntimeError side effect; TestRemoteParserParseError now only
receives the mock it actually uses
- All @pytest.mark.parametrize calls use pytest.param with explicit ids
(pdf, png, jpeg, ...) for readable test output
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- `_make_azure_mock` helper promoted to `make_azure_mock` factory fixture
in conftest.py; tests call `make_azure_mock()` or
`make_azure_mock("custom text")` instead of a module-level function
- `azure_settings` and `no_engine_settings` applied via
`@pytest.mark.usefixtures` wherever their value is not referenced
inside the test body; `TestRemoteParserParseError` marked at the class
level since all three tests need the same setting
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Rewrites the remote OCR parser to the new plugin system contract:
- `supported_mime_types()` is now a classmethod that always returns the
full set of 7 MIME types; the old instance-method hack (returning {}
when unconfigured) is removed
- `score()` classmethod returns None when no remote engine is configured
(making the parser invisible to the registry), and 20 when active —
higher than the tesseract default of 10 so the remote engine takes
priority when both are available
- No longer inherits from RasterisedDocumentParser; inherits no parser
class at all — just implements the protocol directly
- `can_produce_archive = True`; `requires_pdf_rendition = False`
- `_azure_ai_vision_parse()` takes explicit config arg; API client
created and closed within the method
- `get_page_count()` returns the PDF page count for application/pdf,
delegating to the new `get_page_count_for_pdf()` utility
- `extract_metadata()` delegates to `extract_pdf_metadata()` for PDFs;
returns [] for all other MIME types
New files:
- `src/paperless/parsers/utils.py` — shared `extract_pdf_metadata()` and
`get_page_count_for_pdf()` utilities (pikepdf-based); both the remote
and tesseract parsers will use these going forward
- `src/paperless/tests/parsers/test_remote_parser.py` — 42 pytest-style
tests using pytest-django `settings` and pytest-mock `mocker` fixtures
- `src/paperless/tests/parsers/conftest.py` — remote parser instance,
sample-file, and settings-helper fixtures
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Relocates three files to their new homes in the parser plugin system:
- src/paperless_remote/parsers.py
→ src/paperless/parsers/remote.py
- src/paperless_remote/tests/test_parser.py
→ src/paperless/tests/parsers/test_remote_parser.py
- src/paperless_remote/tests/samples/simple-digital.pdf
→ src/paperless/tests/samples/remote/simple-digital.pdf
Content and imports will be updated in the follow-up commit that
rewrites the parser to the new ParserProtocol interface.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: stream manifest parsing with ijson in document_importer
Replace bulk json.load of the full manifest (which materializes the
entire JSON array into memory) with incremental ijson streaming.
Eliminates self.manifest entirely — records are never all in memory
at once.
- Add ijson>=3.2 dependency
- New module-level iter_manifest_records() generator
- load_manifest_files() collects paths only; no parsing at load time
- check_manifest_validity() streams without accumulating records
- decrypt_secret_fields() streams each manifest to a .decrypted.json
temp file record-by-record; temp files cleaned up after file copy
- _import_files_from_manifest() collects only document records (small
fraction of manifest) for the tqdm progress bar
Measured on 200 docs + 200 CustomFieldInstances:
- Streaming validation: peak memory 3081 KiB -> 333 KiB (89% reduction)
- Stream-decrypt to file: peak memory 3081 KiB -> 549 KiB (82% reduction)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: slim dict in _import_files_from_manifest, discard fields
When collecting document records for the file-copy step, extract only
the 4 keys the loop actually uses (pk + 3 exported filename keys) and
discard the full fields dict (content, checksum, tags, etc.).
Peak memory for the document-record list: 939 KiB -> 375 KiB (60% reduction).
Wall time unchanged.
* Refactor: migrate exporter/importer from tqdm to PaperlessCommand.track()
Replace direct tqdm usage in document_exporter and document_importer with
the PaperlessCommand base class and its track() method, which is backed by
Rich and handles --no-progress-bar automatically. Also removes the unused
ProgressBarMixin from mixins.py.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: add explicit supports_progress_bar and supports_multiprocessing to all PaperlessCommand subclasses
Each management command now explicitly declares both class attributes
rather than relying on defaults, making intent unambiguous at a glance.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Perf: streaming manifest writer for document exporter (Phase 3)
Replaces the in-memory manifest dict accumulation with a
StreamingManifestWriter that writes records to manifest.json
incrementally, keeping only one batch resident in memory at a time.
Key changes:
- Add StreamingManifestWriter: writes to .tmp atomically, BLAKE2b
compare for --compare-json, discard() on exception
- Add _encrypt_record_inline(): per-record encryption replacing the
bulk encrypt_secret_fields() call; crypto setup moved before streaming
- Add _write_split_manifest(): extracted per-document manifest writing
- Refactor dump(): non-doc records streamed during transaction, documents
accumulated then written after filenames are assigned
- Upgrade check_and_write_json() from MD5 to BLAKE2b
- Remove encrypt_secret_fields() and unused itertools.chain import
- Add profiling marker to pyproject.toml
Measured improvement (200 docs + 200 CustomFieldInstances, same
dump() code path, only writer differs):
- Peak memory: ~50% reduction
- Memory delta: ~70% reduction
- Wall time and query count: unchanged
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* Refactor: O(1) lookup table for CRYPT_FIELDS in per-record encryption
Add CRYPT_FIELDS_BY_MODEL to CryptMixin, derived from CRYPT_FIELDS at
class definition time. _encrypt_record_inline() now does a single dict
lookup instead of a linear scan per record, eliminating the loop and
break pattern.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Phase 1 -- Eliminate JSON round-trip in document exporter
Replace json.loads(serializers.serialize("json", qs)) with
serializers.serialize("python", qs) to skip the intermediate
JSON string allocation and parse step. Use DjangoJSONEncoder
in check_and_write_json() to handle native Python types
(datetime, Decimal, UUID) the Python serializer returns.
Phase 2 -- Batched QuerySet serialization in document exporter
Add serialize_queryset_batched() helper that uses QuerySet.iterator()
and itertools.islice to stream records in configurable chunks, bounding
peak memory during serialization to batch_size * avg_record_size rather
than loading the entire QuerySet at once.
* Fix: improve test portability
* Make settings always consistent
* Make a few more tests deterministic wrt settings
* Dont pollute settings for this one
* Fix timezone issue with mail parser
* Update test_parser.py
* Uh, I guess OCR gives variants for this
---------
Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
* Uses a custom transport to resolve the slim chance of a DNS rebinding affecting the webhook
* Fix WebhookTransport hostname resolution and validation
* Fix test failures
* Lint
* Keep all internal logic inside WebhookTransport
* Fix test failure
* Update handlers.py
* Update handlers.py
---------
Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
* tests: general cleanup and fixes for runnning under docker
This now allows tests to be run under a locally built or production
docker image with something like:
`docker run --rm -v $PWD:/usr/src/paperless --entrypoint=bash paperlessngx/paperless-ngx:latest -c "uv run pytest"`
Specific fixes:
- fix unreachable code around `assertRaises` blocks
- fix `assertInt` typos
- fix `str(e)` vs `str(e.exception)` issues
- skip permission-based checks when root (in a docker container)
- catch `OSError` problems when instantiating `INotify` and
skip inotify-based tests when it's unavailable.
* Reverts most files to dev while keeping the exception assert fixes
---------
Co-authored-by: Trenton H <797416+stumpylog@users.noreply.github.com>
This alters the retry/backoff logic in the init-wait-for-db script to be more
optimistic about database availability. During regular deployment and
operations of paperless-ngx, it's common to restart the application server with
the database instance already running, so we should optimize for this case.
Instead of unconditionally delaying 5 seconds between each connection attempt,
start with a minimum delay of 1 second and increase the delay linearly with
each attempt, maxing out at 10 seconds. This makes the retry count-based
failure mode less practical, so instead we just use a timeout-based approach.*
*NOTE: the original implementation would have an effective timeout of 25s. This
alters the behavior to 60s.
Additionally, this removes an unnecessary 5s delay that was injected in the
postgres case. The script uses a more comprehensive connection check for
postgres than it does mariadb, so if anything this 5s delay after getting an
"ok" response from the DB was extra unnecessary in the postgres case.
* chore(devcontainer): drop read-only host .gitconfig bind mount
The bind mount prevented adjusting git config inside the dev container, and VS Code Dev Containers already copies the host .gitconfig automatically, making the mount unnecessary. This restores ability to manage git settings within the container.
* chore(gitignore): ignore .pnpm-store folder for pnpm package management
Add .pnpm-store/ to .gitignore to prevent local pnpm package store from being tracked by git when using the devcontainer.
* docs(development): clarify VS Code devcontainer setup steps for Windows
Add instructions, how to overcome some issues caused by using Windows as host system.
This helps prevent excessive processing times on very large documents
by limiting the text analyzed during date parsing, tag prediction,
and correspondent matching.
If the document exceeds 1.2M chars, crop to 1M char.
@@ -89,6 +89,18 @@ Additional tasks are available for common maintenance operations:
- **Migrate Database**: To apply database migrations.
- **Migrate Database**: To apply database migrations.
- **Create Superuser**: To create an admin user for the application.
- **Create Superuser**: To create an admin user for the application.
## Committing from the Host Machine
The DevContainer automatically installs Git pre-commit hooks during setup. However, these hooks are configured for use inside the container.
If you want to commit changes from your host machine (outside the DevContainer), you need to set up prek on your host. This installs it as a standalone tool.
```bash
uv tool install prek && prek install
```
After this, you can commit either from inside the DevContainer or from your host machine.
## Let's Get Started!
## Let's Get Started!
Follow the steps above to get your development environment up and running. Happy coding!
Follow the steps above to get your development environment up and running. Happy coding!
### ⚠️ Please remember: issues are for *bugs* only! ⚠️
That is, something you believe affects every single user of Paperless-ngx, not just you. If you're not sure, start with one of the other options below.
That is, something you believe affects every single user of Paperless-ngx (and the demo, for example), not just you. If you are not sure, start with one of the other options below.
Also, note that **Paperless-ngx does not perform OCR or archive file creation itself**, those are handled by other tools. Problems with OCR or archive versions of specific files should likely be raised 'upstream', see https://github.com/ocrmypdf/OCRmyPDF/issues or https://github.com/tesseract-ocr/tesseract/issues
Also, note that **Paperless-ngx does not perform OCR or archive file creation itself**, those are handled by other tools. Problems with OCR or archive versions of specific files should likely be raised 'upstream', see https://github.com/ocrmypdf/OCRmyPDF/issues or https://github.com/tesseract-ocr/tesseract/issues
- type:markdown
- type:markdown
@@ -59,6 +59,12 @@ body:
label:Browser logs
label:Browser logs
description:Logs from the web browser related to your issue, if needed
description:Logs from the web browser related to your issue, if needed
render:bash
render:bash
- type:textarea
id:logs_services
attributes:
label:Services logs
description:Logs from other services (or containers) related to your issue, if needed. For example, the database or redis logs.
@@ -35,8 +35,8 @@ NOTE: PRs that do not address the following will not be merged, please do not sk
- [ ] I have read & agree with the [contributing guidelines](https://github.com/paperless-ngx/paperless-ngx/blob/main/CONTRIBUTING.md).
- [ ] I have read & agree with the [contributing guidelines](https://github.com/paperless-ngx/paperless-ngx/blob/main/CONTRIBUTING.md).
- [ ] If applicable, I have included testing coverage for new code in this PR, for [backend](https://docs.paperless-ngx.com/development/#testing) and / or [front-end](https://docs.paperless-ngx.com/development/#testing-and-code-style) changes.
- [ ] If applicable, I have included testing coverage for new code in this PR, for [backend](https://docs.paperless-ngx.com/development/#testing) and / or [front-end](https://docs.paperless-ngx.com/development/#testing-and-code-style) changes.
- [ ] If applicable, I have tested my code for new features & regressions on both mobile & desktop devices, using the latest version of major browsers.
- [ ] If applicable, I have tested my code for breaking changes & regressions on both mobile & desktop devices, using the latest version of major browsers.
- [ ] If applicable, I have checked that all tests pass, see [documentation](https://docs.paperless-ngx.com/development/#back-end-development).
- [ ] If applicable, I have checked that all tests pass, see [documentation](https://docs.paperless-ngx.com/development/#back-end-development).
- [ ] I have run all `pre-commit` hooks, see [documentation](https://docs.paperless-ngx.com/development/#code-formatting-with-pre-commit-hooks).
- [ ] I have run all Git `pre-commit` hooks, see [documentation](https://docs.paperless-ngx.com/development/#code-formatting-with-pre-commit-hooks).
- [ ] I have made corresponding changes to the documentation as needed.
- [ ] I have made corresponding changes to the documentation as needed.
- [ ] I have checked my modifications for any breaking changes.
- [ ] In the description of the PR above I have disclosed the use of AI tools in the coding of this PR.
If you feel like contributing to the project, please do! Bug fixes and improvements are always welcome.
If you feel like contributing to the project, please do! Bug fixes and improvements are always welcome.
⚠️ Please note: Pull requests that implement a new feature or enhancement _should almost always target an existing feature request_ with evidence of community interest and discussion. This is in order to balance the work of implementing and maintaining new features / enhancements. Pull requests that are opened without meeting this requirement may not be merged.
If you want to implement something big:
If you want to implement something big:
-Please start a discussion about that in the issues! Maybe something similar is already in development and we can make it happen together.
-As above, please start with a discussion! Maybe something similar is already in development and we can make it happen together.
- When making additions to the project, consider if the majority of users will benefit from your change. If not, you're probably better of forking the project.
- When making additions to the project, consider if the majority of users will benefit from your change. If not, you're probably better of forking the project.
- Also consider if your change will get in the way of other users. A good change is a change that enhances the experience of some users who want that change and does not affect users who do not care about the change.
- Also consider if your change will get in the way of other users. A good change is a change that enhances the experience of some users who want that change and does not affect users who do not care about the change.
- Please see the [paperless-ngx merge process](#merging-prs) below.
- Please see the [paperless-ngx merge process](#merging-prs) below.
## Python
## Python
Paperless supports python 3.10 - 3.12 at this time. We format Python code with [ruff](https://docs.astral.sh/ruff/formatter/).
Paperless-ngx currently supports Python 3.11, 3.12, 3.13, and 3.14. As a policy, we aim to support at least the three most recent Python versions, and drop support for versions as they reach end-of-life. Older versions may be supported if dependencies permit, but this is not guaranteed.
We format Python code with [ruff](https://docs.astral.sh/ruff/formatter/).
## Branches
## Branches
@@ -133,7 +137,7 @@ community members. That said, in an effort to keep the repository organized and
- Issues, pull requests and discussions that are closed will be locked after 30 days of inactivity.
- Issues, pull requests and discussions that are closed will be locked after 30 days of inactivity.
- Discussions with a marked answer will be automatically closed.
- Discussions with a marked answer will be automatically closed.
- Discussions in the 'General' or 'Support' categories will be closed after 180 days of inactivity.
- Discussions in the 'General' or 'Support' categories will be closed after 180 days of inactivity.
- Feature requests that do not meet the following thresholds will be closed: 180 days of inactivity, < 5 "up-votes" after 180 days, < 20 "up-votes" after 1 year or < 80 "up-votes" at 2 years.
- Feature requests that do not meet the following thresholds will be closed: 180 days of inactivity with less than 80 "up-votes", < 5 "up-votes" after 180 days, < 20 "up-votes" after 1 year or < 40 "up-votes" at 2 years.
In all cases, threads can be re-opened by project maintainers and, of course, users can always create a new discussion for related concerns.
In all cases, threads can be re-opened by project maintainers and, of course, users can always create a new discussion for related concerns.
Finally, remember that all information remains searchable and 'closed' feature requests can still serve as inspiration for new features.
Finally, remember that all information remains searchable and 'closed' feature requests can still serve as inspiration for new features.
@@ -10,16 +10,16 @@ consuming documents at that time.
Options available to any installation of paperless:
Options available to any installation of paperless:
- Use the [document exporter](#exporter). The document exporter exports all your documents,
- Use the [document exporter](#exporter). The document exporter exports all your documents,
thumbnails, metadata, and database contents to a specific folder. You may import your
thumbnails, metadata, and database contents to a specific folder. You may import your
documents and settings into a fresh instance of paperless again or store your
documents and settings into a fresh instance of paperless again or store your
documents in another DMS with this export.
documents in another DMS with this export.
The document exporter is also able to update an already existing
The document exporter is also able to update an already existing
export. Therefore, incremental backups with `rsync` are entirely
export. Therefore, incremental backups with `rsync` are entirely
possible.
possible.
The exporter does not include API tokens and they will need to be re-generated after importing.
The exporter does not include API tokens and they will need to be re-generated after importing.
!!! caution
!!! caution
@@ -29,28 +29,27 @@ Options available to any installation of paperless:
Options available to docker installations:
Options available to docker installations:
- Backup the docker volumes. These usually reside within
- Backup the docker volumes. These usually reside within
`/var/lib/docker/volumes` on the host and you need to be root in
`/var/lib/docker/volumes` on the host and you need to be root in
order to access them.
order to access them.
Paperless uses 4 volumes:
Paperless uses 4 volumes:
-`paperless_media`: This is where your documents are stored.
-`paperless_media`: This is where your documents are stored.
-`paperless_data`: This is where auxiliary data is stored. This
-`paperless_data`: This is where auxiliary data is stored. This
folder also contains the SQLite database, if you use it.
folder also contains the SQLite database, if you use it.
-`paperless_pgdata`: Exists only if you use PostgreSQL and
-`paperless_pgdata`: Exists only if you use PostgreSQL and
contains the database.
contains the database.
-`paperless_dbdata`: Exists only if you use MariaDB and contains
-`paperless_dbdata`: Exists only if you use MariaDB and contains
the database.
the database.
Options available to bare-metal and non-docker installations:
Options available to bare-metal and non-docker installations:
- Backup the entire paperless folder. This ensures that if your
- Backup the entire paperless folder. This ensures that if your
paperless instance crashes at some point or your disk fails, you can
paperless instance crashes at some point or your disk fails, you can
simply copy the folder back into place and it works.
simply copy the folder back into place and it works.
When using PostgreSQL or MariaDB, you'll also have to backup the
When using PostgreSQL or MariaDB, you'll also have to backup the
database.
database.
### Restoring {#migrating-restoring}
### Restoring {#migrating-restoring}
@@ -62,6 +61,10 @@ copies you created in the steps above.
## Updating Paperless {#updating}
## Updating Paperless {#updating}
!!! warning
Please review the [migration instructions](migration-v3.md) before upgrading Paperless-ngx to v3.0, it includes some breaking changes that require manual intervention before upgrading.
### Docker Route {#docker-updating}
### Docker Route {#docker-updating}
If a new release of paperless-ngx is available, upgrading depends on how
If a new release of paperless-ngx is available, upgrading depends on how
@@ -471,7 +474,7 @@ Failing to invalidate the cache after such modifications can lead to stale data
Use the following management command to clear the cache:
Use the following management command to clear the cache:
```
```
invalidate_cachalot
python3 manage.py invalidate_cachalot
```
```
!!! info
!!! info
@@ -505,19 +508,19 @@ collection for issues.
The issues detected by the sanity checker are as follows:
The issues detected by the sanity checker are as follows:
- Missing original files.
- Missing original files.
- Missing archive files.
- Missing archive files.
- Inaccessible original files due to improper permissions.
- Inaccessible original files due to improper permissions.
- Inaccessible archive files due to improper permissions.
- Inaccessible archive files due to improper permissions.
- Corrupted original documents by comparing their checksum against
- Corrupted original documents by comparing their checksum against
what is stored in the database.
what is stored in the database.
- Corrupted archive documents by comparing their checksum against what
- Corrupted archive documents by comparing their checksum against what
is stored in the database.
is stored in the database.
- Missing thumbnails.
- Missing thumbnails.
- Inaccessible thumbnails due to improper permissions.
- Inaccessible thumbnails due to improper permissions.
- Documents without any content (warning).
- Documents without any content (warning).
- Orphaned files in the media directory (warning). These are files
- Orphaned files in the media directory (warning). These are files
that are not referenced by any document in paperless.
that are not referenced by any document in paperless.
```
```
document_sanity_checker
document_sanity_checker
@@ -580,39 +583,9 @@ document.
documents, such as encrypted PDF documents. The archiver will skip over
documents, such as encrypted PDF documents. The archiver will skip over
these documents each time it sees them.
these documents each time it sees them.
### Managing encryption {#encryption}
!!! warning
Encryption was removed in [paperless-ng 0.9](changelog.md#paperless-ng-090)
because it did not really provide any additional security, the passphrase
was stored in a configuration file on the same system as the documents.
Furthermore, the entire text content of the documents is stored plain in
the database, even if your documents are encrypted. Filenames are not
encrypted as well. Finally, the web server provides transparent access to
your encrypted documents.
Consider running paperless on an encrypted filesystem instead, which
will then at least provide security against physical hardware theft.
#### Enabling encryption
Enabling encryption is no longer supported.
#### Disabling encryption
Basic usage to disable encryption of your document store:
(Note: If `PAPERLESS_PASSPHRASE` isn't set already, you need to specify
it here)
```
decrypt_documents [--passphrase SECR3TP4SSPHRA$E]
```
### Detecting duplicates {#fuzzy_duplicate}
### Detecting duplicates {#fuzzy_duplicate}
Paperless already catches and prevents upload of exactly matching documents,
Paperless-ngx already catches and warns of exactly matching documents,
however a new scan of an existing document may not produce an exact bit for bit
however a new scan of an existing document may not produce an exact bit for bit
duplicate. But the content should be exact or close, allowing detection.
duplicate. But the content should be exact or close, allowing detection.
- **Any:** Looks for any occurrence of any word provided in match in
- **Any:** Looks for any occurrence of any word provided in match in
the PDF. If you define the match as `Bank1 Bank2`, it will match
the PDF. If you define the match as `Bank1 Bank2`, it will match
documents containing either of these terms.
documents containing either of these terms.
- **All:** Requires that every word provided appears in the PDF,
- **All:** Requires that every word provided appears in the PDF,
albeit not in the order provided.
albeit not in the order provided.
- **Exact:** Matches only if the match appears exactly as provided
- **Exact:** Matches only if the match appears exactly as provided
(i.e. preserve ordering) in the PDF.
(i.e. preserve ordering) in the PDF.
- **Regular expression:** Parses the match as a regular expression and
- **Regular expression:** Parses the match as a regular expression and
tries to find a match within the document.
tries to find a match within the document.
- **Fuzzy match:** Uses a partial matching based on locating the tag text
- **Fuzzy match:** Uses a partial matching based on locating the tag text
inside the document, using a [partial ratio](https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#partial-ratio)
inside the document, using a [partial ratio](https://rapidfuzz.github.io/RapidFuzz/Usage/fuzz.html#partial-ratio)
- **Auto:** Tries to automatically match new documents. This does not
- **Auto:** Tries to automatically match new documents. This does not
require you to set a match. See the [notes below](#automatic-matching).
require you to set a match. See the [notes below](#automatic-matching).
When using the _any_ or _all_ matching algorithms, you can search for
When using the _any_ or _all_ matching algorithms, you can search for
terms that consist of multiple words by enclosing them in double quotes.
terms that consist of multiple words by enclosing them in double quotes.
@@ -69,33 +69,33 @@ Paperless tries to hide much of the involved complexity with this
approach. However, there are a couple caveats you need to keep in mind
approach. However, there are a couple caveats you need to keep in mind
when using this feature:
when using this feature:
- Changes to your documents are not immediately reflected by the
- Changes to your documents are not immediately reflected by the
matching algorithm. The neural network needs to be _trained_ on your
matching algorithm. The neural network needs to be _trained_ on your
documents after changes. Paperless periodically (default: once each
documents after changes. Paperless periodically (default: once each
hour) checks for changes and does this automatically for you.
hour) checks for changes and does this automatically for you.
- The Auto matching algorithm only takes documents into account which
- The Auto matching algorithm only takes documents into account which
are NOT placed in your inbox (i.e. have any inbox tags assigned to
are NOT placed in your inbox (i.e. have any inbox tags assigned to
them). This ensures that the neural network only learns from
them). This ensures that the neural network only learns from
documents which you have correctly tagged before.
documents which you have correctly tagged before.
- The matching algorithm can only work if there is a correlation
- The matching algorithm can only work if there is a correlation
between the tag, correspondent, document type, or storage path and
between the tag, correspondent, document type, or storage path and
the document itself. Your bank statements usually contain your bank
the document itself. Your bank statements usually contain your bank
account number and the name of the bank, so this works reasonably
account number and the name of the bank, so this works reasonably
well, However, tags such as "TODO" cannot be automatically
well, However, tags such as "TODO" cannot be automatically
assigned.
assigned.
- The matching algorithm needs a reasonable number of documents to
- The matching algorithm needs a reasonable number of documents to
identify when to assign tags, correspondents, storage paths, and
identify when to assign tags, correspondents, storage paths, and
types. If one out of a thousand documents has the correspondent
types. If one out of a thousand documents has the correspondent
"Very obscure web shop I bought something five years ago", it will
"Very obscure web shop I bought something five years ago", it will
probably not assign this correspondent automatically if you buy
probably not assign this correspondent automatically if you buy
something from them again. The more documents, the better.
something from them again. The more documents, the better.
- Paperless also needs a reasonable amount of negative examples to
- Paperless also needs a reasonable amount of negative examples to
decide when not to assign a certain tag, correspondent, document
decide when not to assign a certain tag, correspondent, document
type, or storage path. This will usually be the case as you start
type, or storage path. This will usually be the case as you start
filling up paperless with documents. Example: If all your documents
filling up paperless with documents. Example: If all your documents
are either from "Webshop" or "Bank", paperless will assign one
are either from "Webshop" or "Bank", paperless will assign one
of these correspondents to ANY new document, if both are set to
of these correspondents to ANY new document, if both are set to
automatic matching.
automatic matching.
## Hooking into the consumption process {#consume-hooks}
## Hooking into the consumption process {#consume-hooks}
@@ -243,12 +243,12 @@ webserver:
Troubleshooting:
Troubleshooting:
- Monitor the Docker Compose log
- Monitor the Docker Compose log
`cd ~/paperless-ngx; docker compose logs -f`
`cd ~/paperless-ngx; docker compose logs -f`
- Check your script's permission e.g. in case of permission error
- Check your script's permission e.g. in case of permission error
`sudo chmod 755 post-consumption-example.sh`
`sudo chmod 755 post-consumption-example.sh`
- Pipe your scripts's output to a log file e.g.
- Pipe your scripts's output to a log file e.g.
`echo "${DOCUMENT_ID}" | tee --append /usr/src/paperless/scripts/post-consumption-example.log`
`echo "${DOCUMENT_ID}" | tee --append /usr/src/paperless/scripts/post-consumption-example.log`
## File name handling {#file-name-handling}
## File name handling {#file-name-handling}
@@ -262,6 +262,10 @@ your files differently, you can do that by adjusting the
or using [storage paths (see below)](#storage-paths). Paperless adds the
or using [storage paths (see below)](#storage-paths). Paperless adds the
correct file extension e.g. `.pdf`, `.jpg` automatically.
correct file extension e.g. `.pdf`, `.jpg` automatically.
When a document has file versions, each version uses the same naming rules and
storage path resolution as any other document file, with an added version suffix
such as `_v1`, `_v2`, etc.
This variable allows you to configure the filename (folders are allowed)
This variable allows you to configure the filename (folders are allowed)
using placeholders. For example, configuring this to
using placeholders. For example, configuring this to
@@ -303,35 +307,35 @@ will create a directory structure as follows:
Paperless provides the following variables for use within filenames:
Paperless provides the following variables for use within filenames:
- `{{ asn }}`: The archive serial number of the document, or "none".
- `{{ asn }}`: The archive serial number of the document, or "none".
- `{{ correspondent }}`: The name of the correspondent, or "none".
- `{{ correspondent }}`: The name of the correspondent, or "none".
- `{{ document_type }}`: The name of the document type, or "none".
- `{{ document_type }}`: The name of the document type, or "none".
- `{{ tag_list }}`: A comma separated list of all tags assigned to the
- `{{ tag_list }}`: A comma separated list of all tags assigned to the
document.
document.
- `{{ title }}`: The title of the document.
- `{{ title }}`: The title of the document.
- `{{ created }}`: The full date (ISO 8601 format, e.g. `2024-03-14`) the document was created.
- `{{ created }}`: The full date (ISO 8601 format, e.g. `2024-03-14`) the document was created.
- `{{ created_year }}`: Year created only, formatted as the year with
- `{{ created_year }}`: Year created only, formatted as the year with
century.
century.
- `{{ created_year_short }}`: Year created only, formatted as the year
- `{{ created_year_short }}`: Year created only, formatted as the year
without century, zero padded.
without century, zero padded.
- `{{ created_month }}`: Month created only (number 01-12).
- `{{ created_month }}`: Month created only (number 01-12).
- `{{ created_month_name }}`: Month created name, as per locale
- `{{ created_month_name }}`: Month created name, as per locale
- `{{ created_month_name_short }}`: Month created abbreviated name, as per
- `{{ created_month_name_short }}`: Month created abbreviated name, as per
locale
locale
- `{{ created_day }}`: Day created only (number 01-31).
- `{{ created_day }}`: Day created only (number 01-31).
- `{{ added }}`: The full date (ISO format) the document was added to
- `{{ added }}`: The full date (ISO format) the document was added to
paperless.
paperless.
- `{{ added_year }}`: Year added only.
- `{{ added_year }}`: Year added only.
- `{{ added_year_short }}`: Year added only, formatted as the year without
- `{{ added_year_short }}`: Year added only, formatted as the year without
century, zero padded.
century, zero padded.
- `{{ added_month }}`: Month added only (number 01-12).
- `{{ added_month }}`: Month added only (number 01-12).
- `{{ added_month_name }}`: Month added name, as per locale
- `{{ added_month_name }}`: Month added name, as per locale
- `{{ added_month_name_short }}`: Month added abbreviated name, as per
- `{{ added_month_name_short }}`: Month added abbreviated name, as per
locale
locale
- `{{ added_day }}`: Day added only (number 01-31).
- `{{ added_day }}`: Day added only (number 01-31).
- `{{ owner_username }}`: Username of document owner, if any, or "none"
- `{{ owner_username }}`: Username of document owner, if any, or "none"
- `{{ original_name }}`: Document original filename, minus the extension, if any, or "none"
- `{{ original_name }}`: Document original filename, minus the extension, if any, or "none"
- `{{ doc_pk }}`: The paperless identifier (primary key) for the document.
- `{{ doc_pk }}`: The paperless identifier (primary key) for the document.
!!! warning
!!! warning
@@ -353,6 +357,8 @@ If paperless detects that two documents share the same filename,
paperless will automatically append `_01`, `_02`, etc to the filename.
paperless will automatically append `_01`, `_02`, etc to the filename.
This happens if all the placeholders in a filename evaluate to the same
This happens if all the placeholders in a filename evaluate to the same
value.
value.
For versioned files, this counter is appended after the version suffix
(for example `statement_v2_01.pdf`).
If there are any errors in the placeholders included in `PAPERLESS_FILENAME_FORMAT`,
If there are any errors in the placeholders included in `PAPERLESS_FILENAME_FORMAT`,
paperless will fall back to using the default naming scheme instead.
paperless will fall back to using the default naming scheme instead.
@@ -382,10 +388,10 @@ before empty placeholders are removed as well, empty directories are omitted.
When a single storage layout is not sufficient for your use case, storage paths allow for more complex
When a single storage layout is not sufficient for your use case, storage paths allow for more complex
structure to set precisely where each document is stored in the file system.
structure to set precisely where each document is stored in the file system.
- Each storage path is a [`PAPERLESS_FILENAME_FORMAT`](configuration.md#PAPERLESS_FILENAME_FORMAT) and
- Each storage path is a [`PAPERLESS_FILENAME_FORMAT`](configuration.md#PAPERLESS_FILENAME_FORMAT) and
follows the rules described above
follows the rules described above
- Each document is assigned a storage path using the matching algorithms described above, but can be
- Each document is assigned a storage path using the matching algorithms described above, but can be
overwritten at any time
overwritten at any time
For example, you could define the following two storage paths:
For example, you could define the following two storage paths:
@@ -431,8 +437,10 @@ This allows for complex logic to be included in the format, including [logical s
and [filters](https://jinja.palletsprojects.com/en/3.1.x/templates/#id11) to manipulate the [variables](#filename-format-variables)
and [filters](https://jinja.palletsprojects.com/en/3.1.x/templates/#id11) to manipulate the [variables](#filename-format-variables)
provided. The template is provided as a string, potentially multiline, and rendered into a single line.
provided. The template is provided as a string, potentially multiline, and rendered into a single line.
In addition, the entire Document instance is available to be utilized in a more advanced way, as well as some variables which only make sense to be accessed
In addition, a limited `document` object is available for advanced templates.
with more complex logic.
This object includes common metadata fields such as `id`, `pk`, `title`, `content`, `page_count`, `created`, `added`, `modified`, `mime_type`,
`checksum`, `archive_checksum`, `archive_serial_number`, `filename`, `archive_filename`, and `original_filename`.
Related values are available as nested objects with limited fields, for example document.correspondent.name, etc.
#### Custom Jinja2 Filters
#### Custom Jinja2 Filters
@@ -449,13 +457,13 @@ The `get_cf_value` filter retrieves a value from custom field data with optional
###### Parameters
###### Parameters
- `custom_fields`: This _must_ be the provided custom field data
- `custom_fields`: This _must_ be the provided custom field data
- `name` (str): Name of the custom field to retrieve
- `name` (str): Name of the custom field to retrieve
- `default` (str, optional): Default value to return if field is not found or has no value
- `default` (str, optional): Default value to return if field is not found or has no value
###### Returns
###### Returns
- `str | None`: The field value, default value, or `None` if neither exists
- `str | None`: The field value, default value, or `None` if neither exists
###### Examples
###### Examples
@@ -479,12 +487,12 @@ The `datetime` filter formats a datetime string or datetime object using Python'
###### Parameters
###### Parameters
- `value` (str | datetime): Date/time value to format (strings will be parsed automatically)
- `value` (str | datetime): Date/time value to format (strings will be parsed automatically)
- `format` (str): Python strftime format string
- `format` (str): Python strftime format string
###### Returns
###### Returns
- `str`: Formatted datetime string
- `str`: Formatted datetime string
###### Examples
###### Examples
@@ -501,11 +509,12 @@ The `datetime` filter formats a datetime string or datetime object using Python'
See the [strftime format code documentation](https://docs.python.org/3.13/library/datetime.html#strftime-and-strptime-format-codes)
See the [strftime format code documentation](https://docs.python.org/3.13/library/datetime.html#strftime-and-strptime-format-codes)
for the possible codes and their meanings.
for the possible codes and their meanings.
##### Date Localization
##### Date Localization {#date-localization}
The `localize_date` filter formats a date or datetime object into a localized string using Babel internationalization.
The `localize_date` filter formats a date or datetime object into a localized string using Babel internationalization.
This takes into account the provided locale for translation. Since this must be used on a date or datetime object,
This takes into account the provided locale for translation. Since this must be used on a date or datetime object,
you must access the field directly, i.e. `document.created`.
you must access the field directly, i.e. `document.created`.
An ISO string can also be provided to control the output format.
###### Syntax
###### Syntax
@@ -516,13 +525,13 @@ you must access the field directly, i.e. `document.created`.
###### Parameters
###### Parameters
- `value` (date | datetime): Date or datetime object to format (datetime should be timezone-aware)
- `value` (date | datetime | str): Date, datetime object or ISO string to format (datetime should be timezone-aware)
- `format` (str): Format type - either a Babel preset ('short', 'medium', 'long', 'full') or custom pattern
- `format` (str): Format type - either a Babel preset ('short', 'medium', 'long', 'full') or custom pattern
@@ -556,15 +565,15 @@ See the [supported format codes](https://unicode.org/reports/tr35/tr35-dates.htm
### Format Presets
### Format Presets
- **short**: Abbreviated format (e.g., "1/15/24")
- **short**: Abbreviated format (e.g., "1/15/24")
- **medium**: Medium-length format (e.g., "Jan 15, 2024")
- **medium**: Medium-length format (e.g., "Jan 15, 2024")
- **long**: Long format with full month name (e.g., "January 15, 2024")
- **long**: Long format with full month name (e.g., "January 15, 2024")
- **full**: Full format including day of week (e.g., "Monday, January 15, 2024")
- **full**: Full format including day of week (e.g., "Monday, January 15, 2024")
#### Additional Variables
#### Additional Variables
- `{{ tag_name_list }}`: A list of tag names applied to the document, ordered by the tag name. Note this is a list, not a single string
- `{{ tag_name_list }}`: A list of tag names applied to the document, ordered by the tag name. Note this is a list, not a single string
- `{{ custom_fields }}`: A mapping of custom field names to their type and value. A user can access the mapping by field name or check if a field is applied by checking its existence in the variable.
- `{{ custom_fields }}`: A mapping of custom field names to their type and value. A user can access the mapping by field name or check if a field is applied by checking its existence in the variable.
!!! tip
!!! tip
@@ -666,15 +675,15 @@ installation, you can use volumes to accomplish this:
- Version-specific file data (file, mime type, checksums, archive info, extracted text content) belongs to the selected/latest version.
Version-aware endpoints:
- `GET /api/documents/{id}/`: returns root document data; `content` resolves to latest version content by default. Use `?version={version_id}` to resolve content for a specific version.
- `PATCH /api/documents/{id}/`: content updates target the selected version (`?version={version_id}`) or latest version by default; non-content metadata updates target the root document.
- `"set_permissions": PERMISSIONS_OBJ` (see format [above](#permissions)) and / or
- `"set_permissions": PERMISSIONS_OBJ` (see format [above](#permissions)) and / or
- `"owner": OWNER_ID or null`
- `"owner": OWNER_ID or null`
- `"merge": true or false` (defaults to false)
- `"merge": true or false` (defaults to false)
- The `merge` flag determines if the supplied permissions will overwrite all existing permissions (including
- The `merge` flag determines if the supplied permissions will overwrite all existing permissions (including
removing them) or be merged with existing permissions.
removing them) or be merged with existing permissions.
- `edit_pdf`
- `modify_custom_fields`
- Requires `parameters`:
- Requires `parameters`:
- `"doc_ids": [DOCUMENT_ID]` A list of a single document ID to edit.
- `"add_custom_fields": { CUSTOM_FIELD_ID: VALUE }`: JSON object consisting of custom field id:value pairs to add to the document, can also be a list of custom field IDs
- `"operations": [OPERATION, ...]` A list of operations to perform on the documents. Each operation is a dictionary
to add with empty values.
with the following keys:
- `"remove_custom_fields": [CUSTOM_FIELD_ID]`: custom field ids to remove from the document.
- `"page": PAGE_NUMBER` The page number to edit (1-based).
- `"rotate": DEGREES` Optional rotation in degrees (90, 180, 270).
#### Document-editing operations
- `"doc": OUTPUT_DOCUMENT_INDEX` Optional index of the output document for split operations.
- Optional `parameters`:
Beginning with version 10+, the API supports individual endpoints for document-editing operations (`merge`, `rotate`, `edit_pdf`, etc), thus their documentation can be found in the API spec / viewer. Legacy document-editing methods via `/api/documents/bulk_edit/` are still supported for compatibility, are deprecated and clients should migrate to the individual endpoints before they are removed in a future version.
- `"delete_original": true` to delete the original documents after editing.
- `"update_document": true` to update the existing document with the edited PDF.
- `"include_metadata": true` to copy metadata from the original document to the edited document.
- `merge`
- No additional `parameters` required.
- The ordering of the merged document is determined by the list of IDs.
- Optional `parameters`:
- `"metadata_document_id": DOC_ID` apply metadata (tags, correspondent, etc.) from this document to the merged document.
- `"delete_originals": true` to delete the original documents. This requires the calling user being the owner of
all documents that are merged.
- `split`
- Requires `parameters`:
- `"pages": [..]` The list should be a list of pages and/or a ranges, separated by commas e.g. `"[1,2-3,4,5-7]"`
- Optional `parameters`:
- `"delete_originals": true` to delete the original document after consumption. This requires the calling user being the owner of
the document.
- The split operation only accepts a single document.
- `rotate`
- Requires `parameters`:
- `"degrees": DEGREES`. Must be an integer i.e. 90, 180, 270
- `delete_pages`
- Requires `parameters`:
- `"pages": [..]` The list should be a list of integers e.g. `"[2,3,4]"`
- The delete_pages operation only accepts a single document.
- `modify_custom_fields`
- Requires `parameters`:
- `"add_custom_fields": { CUSTOM_FIELD_ID: VALUE }`: JSON object consisting of custom field id:value pairs to add to the document, can also be a list of custom field IDs
to add with empty values.
- `"remove_custom_fields": [CUSTOM_FIELD_ID]`: custom field ids to remove from the document.
### Objects
### Objects
@@ -339,41 +333,38 @@ operations, using the endpoint: `/api/bulk_edit_objects/`, which requires a json
## API Versioning
## API Versioning
The REST API is versioned since Paperless-ngx 1.3.0.
The REST API is versioned.
- Versioning ensures that changes to the API don't break older
- Versioning ensures that changes to the API don't break older
clients.
clients.
- Clients specify the specific version of the API they wish to use
- Clients specify the specific version of the API they wish to use
with every request and Paperless will handle the request using the
with every request and Paperless will handle the request using the
specified API version.
specified API version.
- Even if the underlying data model changes, older API versions will
- Even if the underlying data model changes, supported older API
always serve compatible data.
versions continue to serve compatible data.
- If no version is specified, Paperless will serve version 1 to ensure
- If no version is specified, Paperless serves the configured default
compatibility with older clients that do not request a specific API
API version (currently `10`).
version.
- Supported API versions are currently `9` and `10`.
API versions are specified by submitting an additional HTTP `Accept`
API versions are specified by submitting an additional HTTP `Accept`
header with every request:
header with every request:
```
```
Accept: application/json; version=6
Accept: application/json; version=10
```
```
If an invalid version is specified, Paperless 1.3.0 will respond with
If an invalid version is specified, Paperless responds with
"406 Not Acceptable" and an error message in the body. Earlier
`406 Not Acceptable` and an error message in the body.
versions of Paperless will serve API version 1 regardless of whether a
version is specified via the `Accept` header.
If a client wishes to verify whether it is compatible with any given
If a client wishes to verify whether it is compatible with any given
server, the following procedure should be performed:
server, the following procedure should be performed:
1. Perform an _authenticated_ request against any API endpoint. If the
1. Perform an _authenticated_ request against any API endpoint. The
server is on version 1.3.0 or newer, the server will add two custom
server will add two custom headers to the response:
headers to the response:
```
```
X-Api-Version: 2
X-Api-Version: 10
X-Version: 1.3.0
X-Version: <server-version>
```
```
2. Determine whether the client is compatible with this server based on
2. Determine whether the client is compatible with this server based on
@@ -393,46 +384,56 @@ Initial API version.
#### Version 2
#### Version 2
- Added field `Tag.color`. This read/write string field contains a hex
- Added field `Tag.color`. This read/write string field contains a hex
color such as `#a6cee3`.
color such as `#a6cee3`.
- Added read-only field `Tag.text_color`. This field contains the text
- Added read-only field `Tag.text_color`. This field contains the text
color to use for a specific tag, which is either black or white
color to use for a specific tag, which is either black or white
depending on the brightness of `Tag.color`.
depending on the brightness of `Tag.color`.
- Removed field `Tag.colour`.
- Removed field `Tag.colour`.
#### Version 3
#### Version 3
- Permissions endpoints have been added.
- Permissions endpoints have been added.
- The format of the `/api/ui_settings/` has changed.
- The format of the `/api/ui_settings/` has changed.
#### Version 4
#### Version 4
- Consumption templates were refactored to workflows and API endpoints
- Consumption templates were refactored to workflows and API endpoints
changed as such.
changed as such.
#### Version 5
#### Version 5
- Added bulk deletion methods for documents and objects.
- Added bulk deletion methods for documents and objects.
#### Version 6
#### Version 6
- Moved acknowledge tasks endpoint to be under `/api/tasks/acknowledge/`.
- Moved acknowledge tasks endpoint to be under `/api/tasks/acknowledge/`.
#### Version 7
#### Version 7
- The format of select type custom fields has changed to return the options
- The format of select type custom fields has changed to return the options
as an array of objects with `id` and `label` fields as opposed to a simple
as an array of objects with `id` and `label` fields as opposed to a simple
list of strings. When creating or updating a custom field value of a
list of strings. When creating or updating a custom field value of a
document for a select type custom field, the value should be the `id` of
document for a select type custom field, the value should be the `id` of
the option whereas previously was the index of the option.
the option whereas previously was the index of the option.
#### Version 8
#### Version 8
- The user field of document notes now returns a simplified user object
- The user field of document notes now returns a simplified user object
rather than just the user ID.
rather than just the user ID.
#### Version 9
#### Version 9
- The document `created` field is now a date, not a datetime. The
- The document `created` field is now a date, not a datetime. The
`created_date` field is considered deprecated and will be removed in a
`created_date` field is considered deprecated and will be removed in a
future version.
future version.
#### Version 10
- The `show_on_dashboard` and `show_in_sidebar` fields of saved views have been
removed. Relevant settings are now stored in the UISettings model. Compatibility is maintained
for versions < 10 until support for API v9 is dropped.
- Document-editing operations such as `merge`, `rotate`, and `edit_pdf` have been
moved from the bulk edit endpoint to their own individual endpoints. Using these methods via
the bulk edit endpoint is still supported for compatibility with versions < 10 until support
@@ -184,9 +226,9 @@ Available options are `postgresql` and `mariadb`.
!!! danger
!!! danger
**Do not modify the database outside the application while it is running.**
**Do not modify the database outside the application while it is running.**
This includes actions such as restoring a backup, upgrading the database, or performing manual inserts. All external modifications must be done **only when the application is stopped**.
This includes actions such as restoring a backup, upgrading the database, or performing manual inserts. All external modifications must be done **only when the application is stopped**.
After making any such changes, you **must invalidate the DB read cache** using the `invalidate_cachalot` management command.
After making any such changes, you **must invalidate the DB read cache** using the `invalidate_cachalot` management command.
@@ -196,7 +238,7 @@ Available options are `postgresql` and `mariadb`.
!!! warning
!!! warning
A high TTL increases memory usage over time. Memory may be used until end of TTL, even if the cache is invalidated with the `invalidate_cachalot` command.
A high TTL increases memory usage over time. Memory may be used until end of TTL, even if the cache is invalidated with the `invalidate_cachalot` command.
In case of an out-of-memory (OOM) situation, Redis may stop accepting new data — including cache entries, scheduled tasks, and documents to consume.
In case of an out-of-memory (OOM) situation, Redis may stop accepting new data — including cache entries, scheduled tasks, and documents to consume.
If your system has limited RAM, consider configuring a dedicated Redis instance for the read cache, with a memory limit and the eviction policy set to `allkeys-lru`.
If your system has limited RAM, consider configuring a dedicated Redis instance for the read cache, with a memory limit and the eviction policy set to `allkeys-lru`.
@@ -652,7 +694,7 @@ system. See the corresponding
: Sync groups from the third party authentication system (e.g. OIDC) to Paperless-ngx. When enabled, users will be added or removed from groups based on their group membership in the third party authentication system. Groups must already exist in Paperless-ngx and have the same name as in the third party authentication system. Groups are updated upon logging in via the third party authentication system, see the corresponding [django-allauth documentation](https://docs.allauth.org/en/dev/socialaccount/signals.html).
: Sync groups from the third party authentication system (e.g. OIDC) to Paperless-ngx. When enabled, users will be added or removed from groups based on their group membership in the third party authentication system. Groups must already exist in Paperless-ngx and have the same name as in the third party authentication system. Groups are updated upon logging in via the third party authentication system, see the corresponding [django-allauth documentation](https://docs.allauth.org/en/dev/socialaccount/signals.html).
: In order to pass groups from the authentication system you will need to update your [PAPERLESS_SOCIALACCOUNT_PROVIDERS](#PAPERLESS_SOCIALACCOUNT_PROVIDERS) setting by adding a top-level "SCOPES" setting which includes "groups", e.g.:
: In order to pass groups from the authentication system you will need to update your [PAPERLESS_SOCIALACCOUNT_PROVIDERS](#PAPERLESS_SOCIALACCOUNT_PROVIDERS) setting by adding a top-level "SCOPES" setting which includes "groups", or the custom groups claim configured in [`PAPERLESS_SOCIAL_ACCOUNT_SYNC_GROUPS_CLAIM`](#PAPERLESS_SOCIAL_ACCOUNT_SYNC_GROUPS_CLAIM) e.g.:
: Allows you to define a custom groups claim. See [PAPERLESS_SOCIAL_ACCOUNT_SYNC_GROUPS](#PAPERLESS_SOCIAL_ACCOUNT_SYNC_GROUPS) which is required for this setting to take effect.
: Configures how the consumer detects new files in the consumption directory.
When set to `0` (default), paperless uses native filesystem notifications for efficient, immediate detection of new files.
When set to a positive number, paperless polls the consumption directory at that interval in seconds. Use polling for network filesystems (NFS, SMB/CIFS) where native notifications may not work reliably.
: The model to use for the embedding backend for RAG. This can be set to any of the embedding models supported by the current embedding backend. If not supplied, defaults to "text-embedding-3-small" for OpenAI and "sentence-transformers/all-MiniLM-L6-v2" for Huggingface.
- `parser` is a reference to a class that extends `DocumentParser`.
- `parser` is a reference to a class that extends `DocumentParser`.
- `weight` is used whenever two or more parsers are able to parse a
- `weight` is used whenever two or more parsers are able to parse a
file: The parser with the higher weight wins. This can be used to
file: The parser with the higher weight wins. This can be used to
override the parsers provided by Paperless-ngx.
override the parsers provided by Paperless-ngx.
- `mime_types` is a dictionary. The keys are the mime types your
- `mime_types` is a dictionary. The keys are the mime types your
parser supports and the value is the default file extension that
parser supports and the value is the default file extension that
Paperless-ngx should use when storing files and serving them for
Paperless-ngx should use when storing files and serving them for
download. We could guess that from the file extensions, but some
download. We could guess that from the file extensions, but some
mime types have many extensions associated with them and the Python
mime types have many extensions associated with them and the Python
methods responsible for guessing the extension do not always return
methods responsible for guessing the extension do not always return
the same value.
the same value.
## Using Visual Studio Code devcontainer
## Using Visual Studio Code devcontainer
@@ -470,9 +469,157 @@ To get started:
2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
3. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
3. In case your host operating system is Windows:
- The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
- Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
will initialize the database tables and create a superuser. Then you can compile the front end
will initialize the database tables and create a superuser. Then you can compile the front end
for production or run the frontend in debug mode.
for production or run the frontend in debug mode.
4. The project is ready for debugging, start either run the fullstack debug or individual debug
5. The project is ready for debugging, start either run the fullstack debug or individual debug
processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
## Developing Date Parser Plugins
Paperless-ngx uses a plugin system for date parsing, allowing you to extend or replace the default date parsing behavior. Plugins are discovered using [Python entry points](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
### Creating a Date Parser Plugin
To create a custom date parser plugin, you need to:
1. Create a class that inherits from `DateParserPluginBase`
2. Implement the required abstract method
3. Register your plugin via an entry point
#### 1. Implementing the Parser Class
Your parser must extend `documents.plugins.date_parsing.DateParserPluginBase` and implement the `parse` method:
```python
from collections.abc import Iterator
import datetime
from documents.plugins.date_parsing import DateParserPluginBase
Parse dates from the document's filename and content.
Args:
filename: The original filename of the document
content: The extracted text content of the document
Yields:
datetime.datetime: Valid datetime objects found in the document
"""
# Your parsing logic here
# Use self.config to access configuration settings
# Example: parse dates from filename first
if self.config.filename_date_order:
# Your filename parsing logic
yield some_datetime
# Then parse dates from content
# Your content parsing logic
yield another_datetime
```
#### 2. Configuration and Helper Methods
Your parser instance is initialized with a `DateParserConfig` object accessible via `self.config`. This provides:
- `languages: list[str]` - List of language codes for date parsing
- `timezone_str: str` - Timezone string for date localization
- `ignore_dates: set[datetime.date]` - Dates that should be filtered out
- `reference_time: datetime.datetime` - Current time for filtering future dates
- `filename_date_order: str | None` - Date order preference for filenames (e.g., "DMY", "MDY")
- `content_date_order: str` - Date order preference for content
The base class provides two helper methods you can use:
```python
def _parse_string(
self,
date_string: str,
date_order: str,
) -> datetime.datetime | None:
"""
Parse a single date string using dateparser with configured settings.
"""
def _filter_date(
self,
date: datetime.datetime | None,
) -> datetime.datetime | None:
"""
Validate a parsed datetime against configured rules.
Filters out dates before 1900, future dates, and ignored dates.
"""
```
#### 3. Resource Management (Optional)
If your plugin needs to acquire or release resources (database connections, API clients, etc.), override the context manager methods. Paperless-ngx will always use plugins as context managers, ensuring resources can be released even in the event of errors.
#### 4. Registering Your Plugin
Register your plugin using a setuptools entry point in your package's `pyproject.toml`:
The entry point name (e.g., `"my_parser"`) is used for sorting when multiple plugins are found. Paperless-ngx will use the first plugin alphabetically by name if multiple plugins are discovered.
### Plugin Discovery
Paperless-ngx automatically discovers and loads date parser plugins at runtime. The discovery process:
1. Queries the `paperless_ngx.date_parsers` entry point group
2. Validates that each plugin is a subclass of `DateParserPluginBase`
3. Sorts valid plugins alphabetically by entry point name
4. Uses the first valid plugin, or falls back to the default `RegexDateParserPlugin` if none are found
If multiple plugins are installed, a warning is logged indicating which plugin was selected.
### Example: Simple Date Parser
Here's a minimal example that only looks for ISO 8601 dates:
```python
import datetime
import re
from collections.abc import Iterator
from documents.plugins.date_parsing.base import DateParserPluginBase
class ISODateParserPlugin(DateParserPluginBase):
"""
Parser that only matches ISO 8601 formatted dates (YYYY-MM-DD).
@@ -24,34 +28,36 @@ physical documents into a searchable online archive so you can keep, well, _less
## Features
## Features
- **Organize and index** your scanned documents with tags, correspondents, types, and more.
- **Organize and index** your scanned documents with tags, correspondents, types, and more.
-_Your_ data is stored locally on _your_ server and is never transmitted or shared in any way.
-_Your_ data is stored locally on _your_ server and is never transmitted or shared in any way, unless you explicitly choose to do so.
- Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
- Performs **OCR** on your documents, adding searchable and selectable text, even to documents scanned with only images.
- Utilizes the open-source Tesseract engine to recognize more than 100 languages.
- Utilizes the open-source Tesseract engine to recognize more than 100 languages.
- Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
-_New!_ Supports remote OCR with Azure AI (opt-in).
- Uses machine-learning to automatically add tags, correspondents and document types to your documents.
-Documents are saved as PDF/A format which is designed for long term storage, alongside the unaltered originals.
- Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.
-Uses machine-learning to automatically add tags, correspondents and document types to your documents.
- Paperless stores your documents plain on disk. Filenames and folders are managed by paperless and their format can be configured freely with different configurations assigned to different documents.
-**New**: Paperless-ngx can now leverage AI (Large Language Models or LLMs) for document suggestions. This is an optional feature that can be enabled (and is disabled by default).
-**Beautiful, modern web application** that features:
-Supports PDF documents, images, plain text files, Office documents (Word, Excel, PowerPoint, and LibreOffice equivalents)[^1] and more.
- Customizable dashboard with statistics.
- Paperless stores your documents plain on disk. Filenames and folders are managed by paperless and their format can be configured freely with different configurations assigned to different documents.
- Filtering by tags, correspondents, types, and more.
- **Beautiful, modern web application** that features:
- Bulk editing of tags, correspondents, types and more.
-Customizable dashboard with statistics.
- Drag-and-drop uploading of documents throughout the app.
-Filtering by tags, correspondents, types, and more.
- Customizable views can be saved and displayed on the dashboard and / or sidebar.
-Bulk editing of tags, correspondents, types and more.
- Support for custom fields of various data types.
-Drag-and-drop uploading of documents throughout the app.
- Shareable public links with optional expiration.
-Customizable views can be saved and displayed on the dashboard and / or sidebar.
-**Full text search** helps you find what you need:
-Support for custom fields of various data types.
- Auto completion suggests relevant words from your documents.
-Shareable public links with optional expiration.
- Results are sorted by relevance to your search query.
- **Full text search** helps you find what you need:
- Highlighting shows you which parts of the document matched the query.
-Auto completion suggests relevant words from your documents.
- Searching for similar documents ("More like this")
-Results are sorted by relevance to your search query.
-**Email processing**[^1]: import documents from your email accounts:
-Highlighting shows you which parts of the document matched the query.
- Configure multiple accounts and rules for each account.
-Searching for similar documents ("More like this")
- After processing, paperless can perform actions on the messages such as marking as read, deleting and more.
- **Email processing**[^1]: import documents from your email accounts:
- A built-in robust **multi-user permissions** system that supports 'global' permissions as well as per document or object.
-Configure multiple accounts and rules for each account.
- A powerful workflow system that gives you even more control.
-After processing, paperless can perform actions on the messages such as marking as read, deleting and more.
-**Optimized** for multi core systems: Paperless-ngx consumes multiple documents in parallel.
-A built-in robust **multi-user permissions** system that supports 'global' permissions as well as per document or object.
- The integrated sanity checker makes sure that your document archive is in good health.
-A powerful workflow system that gives you even more control.
- **Optimized** for multi core systems: Paperless-ngx consumes multiple documents in parallel.
- The integrated sanity checker makes sure that your document archive is in good health.
[^1]: Office document and email consumption support is optional and provided by Apache Tika (see [configuration](https://docs.paperless-ngx.com/configuration/#tika))
[^1]: Office document and email consumption support is optional and provided by Apache Tika (see [configuration](https://docs.paperless-ngx.com/configuration/#tika))
| `CONSUMER_POLLING` | [`CONSUMER_POLLING_INTERVAL`](configuration.md#PAPERLESS_CONSUMER_POLLING_INTERVAL) | Renamed for clarity |
| `CONSUMER_INOTIFY_DELAY` | [`CONSUMER_STABILITY_DELAY`](configuration.md#PAPERLESS_CONSUMER_STABILITY_DELAY) | Unified for all modes |
| `CONSUMER_POLLING_DELAY` | _Removed_ | Use `CONSUMER_STABILITY_DELAY` |
| `CONSUMER_POLLING_RETRY_COUNT` | _Removed_ | Automatic with stability tracking |
| `CONSUMER_IGNORE_PATTERNS` | [`CONSUMER_IGNORE_PATTERNS`](configuration.md#PAPERLESS_CONSUMER_IGNORE_PATTERNS) | **Now regex, not fnmatch**; user patterns are added to (not replacing) default ones |
| _New_ | [`CONSUMER_IGNORE_DIRS`](configuration.md#PAPERLESS_CONSUMER_IGNORE_DIRS) | Additional directories to ignore; user entries are added to (not replacing) defaults |
## Encryption Support
Document and thumbnail encryption is no longer supported. This was previously deprecated in [paperless-ng 0.9.3](https://github.com/paperless-ngx/paperless-ngx/blob/dev/docs/changelog.md#paperless-ng-093)
Users must decrypt their document using the `decrypt_documents` command before upgrading.
## Barcode Scanner Changes
Support for [pyzbar](https://github.com/NaturalHistoryMuseum/pyzbar) has been removed. The underlying libzbar library has
seen no updates in 16 years and is largely unmaintained, and the pyzbar Python wrapper last saw a release in March 2022. In
practice, pyzbar struggled with barcode detection reliability, particularly on skewed, low-contrast, or partially
obscured barcodes. [zxing-cpp](https://github.com/zxing-cpp/zxing-cpp) is actively maintained, significantly more
reliable at finding barcodes, and now ships pre-built wheels for both x86_64 and arm64, removing the need to build the library.
The `CONSUMER_BARCODE_SCANNER` setting has been removed. zxing-cpp is now the only backend.
Paperless-ngx is an application that manages your personal documents. With
Paperless-ngx is an application that manages your personal documents. With
the (optional) help of a document scanner (see [the scanners wiki](https://github.com/paperless-ngx/paperless-ngx/wiki/Scanner-&-Software-Recommendations)), Paperless-ngx transforms your unwieldy
the (optional) help of a document scanner (see [the scanners wiki](https://github.com/paperless-ngx/paperless-ngx/wiki/Scanner-&-Software-Recommendations)), Paperless-ngx transforms your unwieldy
@@ -10,42 +14,42 @@ for finding and managing your documents.
Paperless essentially consists of two different parts for managing your
Paperless essentially consists of two different parts for managing your
documents:
documents:
- The _consumer_ watches a specified folder and adds all documents in
- The _consumer_ watches a specified folder and adds all documents in
that folder to paperless.
that folder to paperless.
- The _web server_ (web UI) provides a UI that you use to manage and
- The _web server_ (web UI) provides a UI that you use to manage and
search documents.
search documents.
Each document has data fields that you can assign to them:
Each document has data fields that you can assign to them:
- A _Document_ is a piece of paper that sometimes contains valuable
- A _Document_ is a piece of paper that sometimes contains valuable
information.
information.
- The _correspondent_ of a document is the person, institution or
- The _correspondent_ of a document is the person, institution or
company that a document either originates from, or is sent to.
company that a document either originates from, or is sent to.
- A _tag_ is a label that you can assign to documents. Think of labels
- A _tag_ is a label that you can assign to documents. Think of labels
as more powerful folders: Multiple documents can be grouped together
as more powerful folders: Multiple documents can be grouped together
with a single tag, however, a single document can also have multiple
with a single tag, however, a single document can also have multiple
tags. This is not possible with folders. The reason folders are not
tags. This is not possible with folders. The reason folders are not
implemented in paperless is simply that tags are much more versatile
implemented in paperless is simply that tags are much more versatile
than folders.
than folders.
- A _document type_ is used to demarcate the type of a document such
- A _document type_ is used to demarcate the type of a document such
as letter, bank statement, invoice, contract, etc. It is used to
as letter, bank statement, invoice, contract, etc. It is used to
identify what a document is about.
identify what a document is about.
- The document _storage path_ is the location where the document files
- The document _storage path_ is the location where the document files
are stored. See [Storage Paths](advanced_usage.md#storage-paths) for
are stored. See [Storage Paths](advanced_usage.md#storage-paths) for
more information.
more information.
- The _date added_ of a document is the date the document was scanned
- The _date added_ of a document is the date the document was scanned
into paperless. You cannot and should not change this date.
into paperless. You cannot and should not change this date.
- The _date created_ of a document is the date the document was
- The _date created_ of a document is the date the document was
initially issued. This can be the date you bought a product, the
initially issued. This can be the date you bought a product, the
date you signed a contract, or the date a letter was sent to you.
date you signed a contract, or the date a letter was sent to you.
- The _archive serial number_ (short: ASN) of a document is the
- The _archive serial number_ (short: ASN) of a document is the
identifier of the document in your physical document binders. See
identifier of the document in your physical document binders. See
- The _content_ of a document is the text that was OCR'ed from the
- The _content_ of a document is the text that was OCR'ed from the
document. This text is fed into the search engine and is used for
document. This text is fed into the search engine and is used for
matching tags, correspondents and document types.
matching tags, correspondents and document types.
- Paperless-ngx also supports _custom fields_ which can be used to
- Paperless-ngx also supports _custom fields_ which can be used to
store additional metadata about a document.
store additional metadata about a document.
## The Web UI
## The Web UI
@@ -85,6 +89,17 @@ You can view the document, edit its metadata, assign tags, correspondents,
document types, and custom fields. You can also view the document history,
document types, and custom fields. You can also view the document history,
download the document or share it via a share link.
download the document or share it via a share link.
### Document File Versions
Think of versions as **file history** for a document.
- Versions track the underlying file and extracted text content (OCR/text).
- Metadata such as tags, correspondent, document type, storage path and custom fields stay on the "root" document.
- Version files follow normal filename formatting (including storage paths) and add a `_vN` suffix (for example `_v1`, `_v2`).
- By default, search and document content use the latest version.
- In document detail, selecting a version switches the preview, file metadata and content (and download etc buttons) to that version.
- Deleting a non-root version keeps metadata and falls back to the latest remaining version.
### Management Lists
### Management Lists
Paperless-ngx includes management lists for tags, correspondents, document types
Paperless-ngx includes management lists for tags, correspondents, document types
@@ -92,6 +107,16 @@ and more. These areas allow you to view, add, edit, delete and manage permission
for these objects. You can also manage saved views, mail accounts, mail rules,
for these objects. You can also manage saved views, mail accounts, mail rules,
workflows and more from the management sections.
workflows and more from the management sections.
### Nested Tags
Paperless-ngx v2.19 introduces support for nested tags, allowing you to create a
hierarchy of tags, which may be useful for organizing your documents. Tags can
have a 'parent' tag, creating a tree-like structure, to a maximum depth of 5. When
a tag is added to a document, all of its parent tags are also added automatically
and similarly, when a tag is removed from a document, all of its child tags are
also removed. Additionally, assigning a parent to an existing tag will automatically
update all documents that have this tag assigned, adding the parent tag as well.
## Adding documents to Paperless-ngx
## Adding documents to Paperless-ngx
Once you've got Paperless setup, you need to start feeding documents
Once you've got Paperless setup, you need to start feeding documents
@@ -193,21 +218,20 @@ patterns can include wildcards and multiple patterns separated by a comma.
The actions all ensure that the same mail is not consumed twice by
The actions all ensure that the same mail is not consumed twice by
different means. These are as follows:
different means. These are as follows:
- **Delete:** Immediately deletes mail that paperless has consumed
- **Delete:** Immediately deletes mail that paperless has consumed
documents from. Use with caution.
documents from. Use with caution.
- **Mark as read:** Mark consumed mail as read. Paperless will not
- **Mark as read:** Mark consumed mail as read. Paperless will not
consume documents from already read mails. If you read a mail before
consume documents from already read mails. If you read a mail before
paperless sees it, it will be ignored.
paperless sees it, it will be ignored.
- **Flag:** Sets the 'important' flag on mails with consumed
- **Flag:** Sets the 'important' flag on mails with consumed
documents. Paperless will not consume flagged mails.
documents. Paperless will not consume flagged mails.
- **Move to folder:** Moves consumed mails out of the way so that
- **Move to folder:** Moves consumed mails out of the way so that
paperless won't consume them again.
paperless won't consume them again.
- **Add custom Tag:** Adds a custom tag to mails with consumed
- **Add custom Tag:** Adds a custom tag to mails with consumed
documents (the IMAP standard calls these "keywords"). Paperless
documents (the IMAP standard calls these "keywords"). Paperless
will not consume mails already tagged. Not all mail servers support
will not consume mails already tagged. Not all mail servers support
this feature!
this feature!
- **Apple Mail support:** Apple Mail clients allow differently colored tags. For this to work use `apple:<color>` (e.g. _apple:green_) as a custom tag. Available colors are _red_, _orange_, _yellow_, _blue_, _green_, _violet_ and _grey_.
-**Apple Mail support:** Apple Mail clients allow differently colored tags. For this to work use `apple:<color>` (e.g. _apple:green_) as a custom tag. Available colors are _red_, _orange_, _yellow_, _blue_, _green_, _violet_ and _grey_.
!!! warning
!!! warning
@@ -251,6 +275,10 @@ different means. These are as follows:
Paperless is set up to check your mails every 10 minutes. This can be
Paperless is set up to check your mails every 10 minutes. This can be
configured via [`PAPERLESS_EMAIL_TASK_CRON`](configuration.md#PAPERLESS_EMAIL_TASK_CRON)
configured via [`PAPERLESS_EMAIL_TASK_CRON`](configuration.md#PAPERLESS_EMAIL_TASK_CRON)
#### Processed Mail
Paperless keeps track of emails it has processed in order to avoid processing the same mail multiple times. This uses the message `UID` provided by the mail server, which should be unique for each message. You can view and manage processed mails from the web UI under Mail > Processed Mails. If you need to re-process a message, you can delete the corresponding processed mail entry, which will allow Paperless-ngx to process the email again the next time the mail fetch task runs.
#### OAuth Email Setup
#### OAuth Email Setup
Paperless-ngx supports OAuth2 authentication for Gmail and Outlook email accounts. To set up an email account with OAuth2, you will need to create a 'developer' app with the respective provider and obtain the client ID and client secret and set the appropriate [configuration variables](configuration.md#email_oauth). You will also need to set either [`PAPERLESS_OAUTH_CALLBACK_BASE_URL`](configuration.md#PAPERLESS_OAUTH_CALLBACK_BASE_URL) or [`PAPERLESS_URL`](configuration.md#PAPERLESS_URL) to the correct value for the OAuth2 flow to work correctly.
Paperless-ngx supports OAuth2 authentication for Gmail and Outlook email accounts. To set up an email account with OAuth2, you will need to create a 'developer' app with the respective provider and obtain the client ID and client secret and set the appropriate [configuration variables](configuration.md#email_oauth). You will also need to set either [`PAPERLESS_OAUTH_CALLBACK_BASE_URL`](configuration.md#PAPERLESS_OAUTH_CALLBACK_BASE_URL) or [`PAPERLESS_URL`](configuration.md#PAPERLESS_URL) to the correct value for the OAuth2 flow to work correctly.
@@ -264,6 +292,28 @@ Once setup, navigating to the email settings page in Paperless-ngx will allow yo
You can also submit a document using the REST API, see [POSTing documents](api.md#file-uploads)
You can also submit a document using the REST API, see [POSTing documents](api.md#file-uploads)
for details.
for details.
## Document Suggestions
Paperless-ngx can suggest tags, correspondents, document types and storage paths for documents based on the content of the document. This is done using a (non-LLM) machine learning model that is trained on the documents in your database. The suggestions are shown in the document detail page and can be accepted or rejected by the user.
## AI Features
Paperless-ngx includes several features that use AI to enhance the document management experience. These features are optional and can be enabled or disabled in the settings. If you are using the AI features, you may want to also enable the "LLM index" feature, which supports Retrieval-Augmented Generation (RAG) designed to improve the quality of AI responses. The LLM index feature is not enabled by default and requires additional configuration.
!!! warning
Remember that Paperless-ngx will send document content to the AI provider you have configured, so consider the privacy implications of using these features, especially if using a remote model (e.g. OpenAI), instead of the default local model.
The AI features work by creating an embedding of the text content and metadata of documents, which is then used for various tasks such as similarity search and question answering. This uses the FAISS vector store.
### AI-Enhanced Suggestions
If enabled, Paperless-ngx can use an AI LLM model to suggest document titles, dates, tags, correspondents and document types for documents. This feature will always be "opt-in" and does not disable the existing classifier-based suggestion system. Currently, both remote (via the OpenAI API) and local (via Ollama) models are supported, see [configuration](configuration.md#ai) for details.
### Document Chat
Paperless-ngx can use an AI LLM model to answer questions about a document or across multiple documents. Again, this feature works best when RAG is enabled. The chat feature is available in the upper app toolbar and will switch between chatting across multiple documents or a single document based on the current view.
## Sharing documents from Paperless-ngx
## Sharing documents from Paperless-ngx
Paperless-ngx supports sharing documents with other users by assigning them [permissions](#object-permissions)
Paperless-ngx supports sharing documents with other users by assigning them [permissions](#object-permissions)
@@ -272,12 +322,14 @@ or using [email](#workflow-action-email) or [webhook](#workflow-action-webhook)
### Share Links
### Share Links
"Share links" are shareable public links to files and can be created and managed under the 'Send' button on the document detail screen.
"Share links" are public links to files (or an archive of files) and can be created and managed under the 'Send' button on the document detail screen or from the bulk editor.
- Share links do not require a user to login and thus link directly to a file.
- Share links do not require a user to login and thus link directly to a file or bundled download.
- Links are unique and are of the form `{paperless-url}/share/{randomly-generated-slug}`.
- Links are unique and are of the form `{paperless-url}/share/{randomly-generated-slug}`.
- Links can optionally have an expiration time set.
- Links can optionally have an expiration time set.
- After a link expires or is deleted users will be redirected to the regular paperless-ngx login.
- After a link expires or is deleted users will be redirected to the regular paperless-ngx login.
- From the document detail screen you can create a share link for that single document.
- From the bulk editor you can create a **share link bundle** for any selection. Paperless-ngx prepares a ZIP archive in the background and exposes a single share link. You can revisit the "Manage share link bundles" dialog to monitor progress, retry failed bundles, or delete links.
!!! tip
!!! tip
@@ -330,6 +382,11 @@ permissions can be granted to limit access to certain parts of the UI (and corre
Superusers can access all parts of the front and backend application as well as any and all objects. Superuser status can only be granted by another superuser.
Superusers can access all parts of the front and backend application as well as any and all objects. Superuser status can only be granted by another superuser.
!!! tip
Because superuser accounts can see all objects and documents, you may want to use a regular account for day-to-day use. Additional superuser accounts can
be created via [cli](administration.md#create-superuser) or granted superuser status from an existing superuser account.
#### Admin Status
#### Admin Status
Admin status (Django 'staff status') grants access to viewing the paperless logs and the system status dialog
Admin status (Django 'staff status') grants access to viewing the paperless logs and the system status dialog
@@ -400,7 +457,7 @@ fields and permissions, which will be merged.
#### Types {#workflow-trigger-types}
#### Types {#workflow-trigger-types}
Currently, there are three events that correspond to workflow trigger 'types':
Currently, there are four events that correspond to workflow trigger 'types':
1.**Consumption Started**: _before_ a document is consumed, so events can include filters by source (mail, consumption
1.**Consumption Started**: _before_ a document is consumed, so events can include filters by source (mail, consumption
folder or API), file path, file name, mail rule
folder or API), file path, file name, mail rule
@@ -408,12 +465,12 @@ Currently, there are three events that correspond to workflow trigger 'types':
but the document content has been extracted and metadata such as document type, tags, etc. have been set, so these can now
but the document content has been extracted and metadata such as document type, tags, etc. have been set, so these can now
be used for filtering.
be used for filtering.
3.**Document Updated**: when a document is updated. Similar to 'added' events, triggers can include filtering by content matching,
3.**Document Updated**: when a document is updated. Similar to 'added' events, triggers can include filtering by content matching,
tags, doc type, or correspondent.
tags, doc type, correspondent or storage path.
4.**Scheduled**: a scheduled trigger that can be used to run workflows at a specific time. The date used can be either the document
4.**Scheduled**: a scheduled trigger that can be used to run workflows at a specific time. The date used can be either the document
added, created, updated date or you can specify a (date) custom field. You can also specify a day offset from the date (positive
added, created, updated date or you can specify a (date) custom field. You can also specify a day offset from the date (positive
offsets will trigger after the date, negative offsets will trigger before).
offsets will trigger after the date, negative offsets will trigger before).
The following flow diagram illustrates the three document trigger types:
The following flow diagram illustrates the four document trigger types:
```mermaid
```mermaid
flowchart TD
flowchart TD
@@ -429,6 +486,10 @@ flowchart TD
'Updated'
'Updated'
trigger(s)"}
trigger(s)"}
scheduled{"Documents
matching
trigger(s)"}
A[New Document] --> consumption
A[New Document] --> consumption
consumption --> |Yes| C[Workflow Actions Run]
consumption --> |Yes| C[Workflow Actions Run]
consumption --> |No| D
consumption --> |No| D
@@ -441,21 +502,36 @@ flowchart TD
updated --> |Yes| J[Workflow Actions Run]
updated --> |Yes| J[Workflow Actions Run]
updated --> |No| K
updated --> |No| K
J --> K[Document Saved]
J --> K[Document Saved]
L[Scheduled Task Check<br/>hourly at :05] --> M[Get All Scheduled Triggers]
M --> scheduled
scheduled --> |Yes| N[Workflow Actions Run]
scheduled --> |No| O[Document Saved]
N --> O
```
```
#### Filters {#workflow-trigger-filters}
#### Filters {#workflow-trigger-filters}
Workflows allow you to filter by:
Workflows allow you to filter by:
- Source, e.g. documents uploaded via consume folder, API (& the web UI) and mail fetch
- Source, e.g. documents uploaded via consume folder, API (& the web UI) and mail fetch
- File name, including wildcards e.g. \*.pdf will apply to all pdfs
- File name, including wildcards e.g. \*.pdf will apply to all pdfs.
- File path, including wildcards. Note that enabling `PAPERLESS_CONSUMER_RECURSIVE` would allow, for
- File path, including wildcards. Note that enabling `PAPERLESS_CONSUMER_RECURSIVE` would allow, for
example, automatically assigning documents to different owners based on the upload directory.
example, automatically assigning documents to different owners based on the upload directory.
- Mail rule. Choosing this option will force 'mail fetch' to be the workflow source.
- Mail rule. Choosing this option will force 'mail fetch' to be the workflow source.
- Content matching (`Added` and `Updated` triggers only). Filter document content using the matching settings.
- Content matching (`Added`, `Updated` and `Scheduled` triggers only). Filter document content using the matching settings.
- Tags (`Added` and `Updated` triggers only). Filter for documents with any of the specified tags
- Document type (`Added` and`Updated`triggers only). Filter documents with this doc type
There are also 'advanced' filters available for `Added`,`Updated`and `Scheduled` triggers:
- Correspondent (`Added` and `Updated` triggers only). Filter documents with this correspondent
- Any Tags: Filter for documents with any of the specified tags.
- All Tags: Filter for documents with all of the specified tags.
- No Tags: Filter for documents with none of the specified tags.
- Document type: Filter documents with this document type.
- Not Document types: Filter documents without any of these document types.
- Correspondent: Filter documents with this correspondent.
- Not Correspondents: Filter documents without any of these correspondents.
- Storage path: Filter documents with this storage path.
- Not Storage paths: Filter documents without any of these storage paths.
- Custom field query: Filter documents with a custom field query (the same as used for the document list filters).
### Workflow Actions
### Workflow Actions
@@ -467,73 +543,104 @@ The following workflow action types are available:
"Assignment" actions can assign:
"Assignment" actions can assign:
- Title, see [workflow placeholders](usage.md#workflow-placeholders) below
- Title, see [workflow placeholders](usage.md#workflow-placeholders) below
- Tags, correspondent, document type and storage path
- Tags, correspondent, document type and storage path
- Document owner
- Document owner
- View and / or edit permissions to users or groups
- View and / or edit permissions to users or groups
- Custom fields. Note that no value for the field will be set
- Custom fields. Note that no value for the field will be set
##### Removal {#workflow-action-removal}
##### Removal {#workflow-action-removal}
"Removal" actions can remove either all of or specific sets of the following:
"Removal" actions can remove either all of or specific sets of the following:
- Tags, correspondents, document types or storage paths
- Tags, correspondents, document types or storage paths
- Document owner
- Document owner
- View and / or edit permissions
- View and / or edit permissions
- Custom fields
- Custom fields
##### Email {#workflow-action-email}
##### Email {#workflow-action-email}
"Email" actions can send documents via email. This action requires a mail server to be [configured](configuration.md#email-sending). You can specify:
"Email" actions can send documents via email. This action requires a mail server to be [configured](configuration.md#email-sending). You can specify:
- The recipient email address(es) separated by commas
- The recipient email address(es) separated by commas
- The subject and body of the email, which can include placeholders, see [placeholders](usage.md#workflow-placeholders) below
- The subject and body of the email, which can include placeholders, see [placeholders](usage.md#workflow-placeholders) below
- Whether to include the document as an attachment
- Whether to include the document as an attachment
##### Webhook {#workflow-action-webhook}
##### Webhook {#workflow-action-webhook}
"Webhook" actions send a POST request to a specified URL. You can specify:
"Webhook" actions send a POST request to a specified URL. You can specify:
- The URL to send the request to
- The URL to send the request to
- The request body as text or as key-value pairs, which can include placeholders, see [placeholders](usage.md#workflow-placeholders) below.
- The request body as text or as key-value pairs, which can include placeholders, see [placeholders](usage.md#workflow-placeholders) below.
- Encoding for the request body, either JSON or form data
- Encoding for the request body, either JSON or form data
- The request headers as key-value pairs
- The request headers as key-value pairs
For security reasons, webhooks can be limited to specific ports and disallowed from connecting to local URLs. See the relevant
For security reasons, webhooks can be limited to specific ports and disallowed from connecting to local URLs. See the relevant
[configuration settings](configuration.md#workflow-webhooks) to change this behavior. If you are allowing non-admins to create workflows,
[configuration settings](configuration.md#workflow-webhooks) to change this behavior. If you are allowing non-admins to create workflows,
you may want to adjust these settings to prevent abuse.
you may want to adjust these settings to prevent abuse.
##### Move to Trash {#workflow-action-move-to-trash}
"Move to Trash" actions move the document to the trash. The document can be restored
from the trash until the trash is emptied (after the configured delay or manually).
The "Move to Trash" action will always be executed at the end of the workflow run,
regardless of its position in the action list. After a "Move to Trash" action is executed
no other workflow will be executed on the document.
If a "Move to Trash" action is executed in a consume pipeline, the consumption
will be aborted and the file will be deleted.
#### Workflow placeholders
#### Workflow placeholders
Some workflow text can include placeholders but the available options differ depending on the type of
Titles and webhook payloads can be generated by workflows using [Jinja templates](https://jinja.palletsprojects.com/en/3.1.x/templates/).
workflow trigger. This is because at the time of consumption (when the text is to be set), no automatic tags etc. have been
This allows for complex logic to beused, including [logical structures](https://jinja.palletsprojects.com/en/3.1.x/templates/#list-of-control-structures)
applied. You can use the following placeholders with any trigger type:
and [filters](https://jinja.palletsprojects.com/en/3.1.x/templates/#id11).
The template is provided as a string.
-`{correspondent}`: assigned correspondent name
Using Jinja2 Templates is also useful for [Date localization](advanced_usage.md#date-localization) in the title.
-`{document_type}`: assigned document type name
-`{owner_username}`: assigned owner username
The available inputs differ depending on the type of workflow trigger.
-`{added}`: added datetime
This is because at the time of consumption (when the text is to be set), no automatic tags etc. have been
-`{added_year}`: added year
applied. You can use the following placeholders in the template with any trigger type:
-`{added_year_short}`: added year
-`{added_month}`: added month
-`{{correspondent}}`: assigned correspondent name
-`{added_month_name}`: added month name
-`{{document_type}}`: assigned document type name
-`{added_month_name_short}`: added month short name
-`{{owner_username}}`: assigned owner username
-`{added_day}`: added day
-`{{added}}`: added datetime
-`{added_time}`: added time in HH:MM format
-`{{added_year}}`: added year
-`{original_filename}`: original file name without extension
-`{{added_year_short}}`: added year
-`{filename}`: current file name without extension
-`{{added_month}}`: added month
-`{{added_month_name}}`: added month name
-`{{added_month_name_short}}`: added month short name
-`{{added_day}}`: added day
-`{{added_time}}`: added time in HH:MM format
-`{{original_filename}}`: original file name without extension
-`{{filename}}`: current file name without extension (for "added" workflows this may not be final yet, you can use `{{original_filename}}`)
-`{{doc_title}}`: current document title (cannot be used in title assignment)
The following placeholders are only available for "added" or "updated" triggers
The following placeholders are only available for "added" or "updated" triggers
-`{created}`: created datetime
-`{{created}}`: created datetime
-`{created_year}`: created year
-`{{created_year}}`: created year
-`{created_year_short}`: created year
-`{{created_year_short}}`: created year
-`{created_month}`: created month
-`{{created_month}}`: created month
-`{created_month_name}`: created month name
-`{{created_month_name}}`: created month name
-`{created_month_name_short}`: created month short name
-`{{created_month_name_short}}`: created month short name
-`{created_day}`: created day
-`{{created_day}}`: created day
-`{created_time}`: created time in HH:MM format
-`{{created_time}}`: created time in HH:MM format
-`{doc_url}`: URL to the document in the web UI. Requires the `PAPERLESS_URL` setting to be set.
-`{{doc_url}}`: URL to the document in the web UI. Requires the `PAPERLESS_URL` setting to be set.
-`{{doc_id}}`: Document ID
##### Examples
```jinja2
{{ created | localize_date('MMMM', 'en_US') }}
<!-- Output: "January" -->
{{ added | localize_date('MMMM', 'de_DE') }}
<!-- Output: "Juni" --> # codespell:ignore
```
### Workflow permissions
### Workflow permissions
@@ -568,26 +675,26 @@ Multiple fields may be attached to a document but the same field name cannot be
The following custom field types are supported:
The following custom field types are supported:
-`Text`: any text
-`Text`: any text
-`Boolean`: true / false (check / unchecked) field
-`Boolean`: true / false (check / unchecked) field
-`Date`: date
-`Date`: date
-`URL`: a valid url
-`URL`: a valid url
-`Integer`: integer number e.g. 12
-`Integer`: integer number e.g. 12
-`Number`: float number e.g. 12.3456
-`Number`: float number e.g. 12.3456
-`Monetary`: [ISO 4217 currency code](https://en.wikipedia.org/wiki/ISO_4217#List_of_ISO_4217_currency_codes) and a number with exactly two decimals, e.g. USD12.30
-`Monetary`: [ISO 4217 currency code](https://en.wikipedia.org/wiki/ISO_4217#List_of_ISO_4217_currency_codes) and a number with exactly two decimals, e.g. USD12.30
-`Document Link`: reference(s) to other document(s) displayed as links, automatically creates a symmetrical link in reverse
-`Document Link`: reference(s) to other document(s) displayed as links, automatically creates a symmetrical link in reverse
-`Select`: a pre-defined list of strings from which the user can choose
-`Select`: a pre-defined list of strings from which the user can choose
## PDF Actions
## PDF Actions
Paperless-ngx supports basic editing operations for PDFs (these operations currently cannot be performed on non-PDF files). When viewing an individual document you can
Paperless-ngx supports basic editing operations for PDFs (these operations currently cannot be performed on non-PDF files). When viewing an individual document you can
open the 'PDF Editor' to use a simple UI for re-arranging, rotating, deleting pages and splitting documents.
open the 'PDF Editor' to use a simple UI for re-arranging, rotating, deleting pages and splitting documents.
- Merging documents: available when selecting multiple documents for 'bulk editing'.
- Merging documents: available when selecting multiple documents for 'bulk editing'.
- Rotating documents: available when selecting multiple documents for 'bulk editing' and via the pdf editor on an individual document's details page.
- Rotating documents: available when selecting multiple documents for 'bulk editing' and via the pdf editor on an individual document's details page.
- Splitting documents: via the pdf editor on an individual document's details page.
- Splitting documents: via the pdf editor on an individual document's details page.
- Deleting pages: via the pdf editor on an individual document's details page.
- Deleting pages: via the pdf editor on an individual document's details page.
- Re-arranging pages: via the pdf editor on an individual document's details page.
- Re-arranging pages: via the pdf editor on an individual document's details page.
!!! important
!!! important
@@ -605,7 +712,7 @@ When you first delete a document it is moved to the 'trash' until either it is e
You can set how long documents remain in the trash before being automatically deleted with [`PAPERLESS_EMPTY_TRASH_DELAY`](configuration.md#PAPERLESS_EMPTY_TRASH_DELAY), which defaults
You can set how long documents remain in the trash before being automatically deleted with [`PAPERLESS_EMPTY_TRASH_DELAY`](configuration.md#PAPERLESS_EMPTY_TRASH_DELAY), which defaults
to 30 days. Until the file is actually deleted (e.g. the trash is emptied), all files and database content remains intact and can be restored at any point up until that time.
to 30 days. Until the file is actually deleted (e.g. the trash is emptied), all files and database content remains intact and can be restored at any point up until that time.
Additionally you may configure a directory where deleted files are moved to when they the trash is emptied with [`PAPERLESS_EMPTY_TRASH_DIR`](configuration.md#PAPERLESS_EMPTY_TRASH_DIR).
Additionally you may configure a directory where deleted files are moved to when the trash is emptied with [`PAPERLESS_EMPTY_TRASH_DIR`](configuration.md#PAPERLESS_EMPTY_TRASH_DIR).
Note that the empty trash directory only stores the original file, the archive file and all database information is permanently removed once a document is fully deleted.
Note that the empty trash directory only stores the original file, the archive file and all database information is permanently removed once a document is fully deleted.
## Best practices {#basic-searching}
## Best practices {#basic-searching}
@@ -665,18 +772,18 @@ the system.
Here are a couple examples of tags and types that you could use in your
Here are a couple examples of tags and types that you could use in your
collection.
collection.
- An `inbox` tag for newly added documents that you haven't manually
- An `inbox` tag for newly added documents that you haven't manually
edited yet.
edited yet.
- A tag `car` for everything car related (repairs, registration,
- A tag `car` for everything car related (repairs, registration,
insurance, etc)
insurance, etc)
- A tag `todo` for documents that you still need to do something with,
- A tag `todo` for documents that you still need to do something with,
such as reply, or perform some task online.
such as reply, or perform some task online.
- A tag `bank account x` for all bank statement related to that
- A tag `bank account x` for all bank statement related to that
account.
account.
- A tag `mail` for anything that you added to paperless via its mail
- A tag `mail` for anything that you added to paperless via its mail
processing capabilities.
processing capabilities.
- A tag `missing_metadata` when you still need to add some metadata to
- A tag `missing_metadata` when you still need to add some metadata to
a document, but can't or don't want to do this right now.
a document, but can't or don't want to do this right now.
## Searching {#basic-usage_searching}
## Searching {#basic-usage_searching}
@@ -765,8 +872,8 @@ The following diagram shows how easy it is to manage your documents.
### Preparations in paperless
### Preparations in paperless
- Create an inbox tag that gets assigned to all new documents.
- Create an inbox tag that gets assigned to all new documents.
- Create a TODO tag.
- Create a TODO tag.
### Processing of the physical documents
### Processing of the physical documents
@@ -840,78 +947,92 @@ Some documents require attention and require you to act on the document.
You may take two different approaches to handle these documents based on
You may take two different approaches to handle these documents based on
how regularly you intend to scan documents and use paperless.
how regularly you intend to scan documents and use paperless.
- If you scan and process your documents in paperless regularly,
- If you scan and process your documents in paperless regularly,
assign a TODO tag to all scanned documents that you need to process.
assign a TODO tag to all scanned documents that you need to process.
Create a saved view on the dashboard that shows all documents with
Create a saved view on the dashboard that shows all documents with
this tag.
this tag.
- If you do not scan documents regularly and use paperless solely for
- If you do not scan documents regularly and use paperless solely for
archiving, create a physical todo box next to your physical inbox
archiving, create a physical todo box next to your physical inbox
and put documents you need to process in the TODO box. When you
and put documents you need to process in the TODO box. When you
performed the task associated with the document, move it to the
performed the task associated with the document, move it to the
inbox.
inbox.
## Remote OCR
!!! important
This feature is disabled by default and will always remain strictly "opt-in".
Paperless-ngx supports performing OCR on documents using remote services. At the moment, this is limited to
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.