mirror of https://github.com/paperless-ngx/paperless-ngx.git synced 2026-05-03 13:15:25 +00:00

T

Trenton H 2cbe6ae892 Feature: Convert remote AI parser to plugin system (#12334 )

* Refactor: move remote parser, test, and sample to paperless.parsers

Relocates three files to their new homes in the parser plugin system:

- src/paperless_remote/parsers.py
    → src/paperless/parsers/remote.py
- src/paperless_remote/tests/test_parser.py
    → src/paperless/tests/parsers/test_remote_parser.py
- src/paperless_remote/tests/samples/simple-digital.pdf
    → src/paperless/tests/samples/remote/simple-digital.pdf

Content and imports will be updated in the follow-up commit that
rewrites the parser to the new ParserProtocol interface.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Feature: migrate RemoteDocumentParser to ParserProtocol interface

Rewrites the remote OCR parser to the new plugin system contract:

- `supported_mime_types()` is now a classmethod that always returns the
  full set of 7 MIME types; the old instance-method hack (returning {}
  when unconfigured) is removed
- `score()` classmethod returns None when no remote engine is configured
  (making the parser invisible to the registry), and 20 when active —
  higher than the tesseract default of 10 so the remote engine takes
  priority when both are available
- No longer inherits from RasterisedDocumentParser; inherits no parser
  class at all — just implements the protocol directly
- `can_produce_archive = True`; `requires_pdf_rendition = False`
- `_azure_ai_vision_parse()` takes explicit config arg; API client
  created and closed within the method
- `get_page_count()` returns the PDF page count for application/pdf,
  delegating to the new `get_page_count_for_pdf()` utility
- `extract_metadata()` delegates to `extract_pdf_metadata()` for PDFs;
  returns [] for all other MIME types

New files:
- `src/paperless/parsers/utils.py` — shared `extract_pdf_metadata()` and
  `get_page_count_for_pdf()` utilities (pikepdf-based); both the remote
  and tesseract parsers will use these going forward
- `src/paperless/tests/parsers/test_remote_parser.py` — 42 pytest-style
  tests using pytest-django `settings` and pytest-mock `mocker` fixtures
- `src/paperless/tests/parsers/conftest.py` — remote parser instance,
  sample-file, and settings-helper fixtures

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: use fixture factory and usefixtures in remote parser tests

- `_make_azure_mock` helper promoted to `make_azure_mock` factory fixture
  in conftest.py; tests call `make_azure_mock()` or
  `make_azure_mock("custom text")` instead of a module-level function
- `azure_settings` and `no_engine_settings` applied via
  `@pytest.mark.usefixtures` wherever their value is not referenced
  inside the test body; `TestRemoteParserParseError` marked at the class
  level since all three tests need the same setting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: improve remote parser test fixture structure

- make_azure_mock moved from conftest.py back into test_remote_parser.py;
  it is specific to that module and does not belong in shared fixtures
- azure_client fixture composes azure_settings + make_azure_mock + patch
  in one step; tests no longer repeat the mocker.patch call or carry an
  unused azure_settings parameter
- failing_azure_client fixture similarly composes azure_settings + patch
  with a RuntimeError side effect; TestRemoteParserParseError now only
  receives the mock it actually uses
- All @pytest.mark.parametrize calls use pytest.param with explicit ids
  (pdf, png, jpeg, ...) for readable test output

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: wire RemoteDocumentParser into consumer and fix signals

- paperless_remote/signals.py: import from paperless.parsers.remote
  (new location after git mv). supported_mime_types() is now a
  classmethod that always returns the full set, so get_supported_mime_types()
  in the signal layer explicitly checks RemoteEngineConfig validity and
  returns {} when unconfigured — preserving the old behaviour where an
  unconfigured remote parser does not register for any MIME types.

- documents/consumer.py: extend the _parser_cleanup() shim, parse()
  dispatch, and get_thumbnail() dispatch to include RemoteDocumentParser
  alongside TextDocumentParser. Both new-style parsers use __exit__
  for cleanup and take (document_path, mime_type) without a file_name
  argument.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Refactor: fix type errors in remote parser and signals

- remote.py: add `if TYPE_CHECKING: assert` guards before the Azure
  client construction to narrow config.endpoint and config.api_key from
  str|None to str. The narrowing is safe: engine_is_valid() guarantees
  both are non-None when it returns True (api_key explicitly; endpoint
  via `not (engine=="azureai" and endpoint is None)` for the only valid
  engine). Asserts are wrapped in TYPE_CHECKING so they carry zero
  runtime cost.

- signals.py: add full type annotations — return types, Any-typed
  sender parameter, and explicit logging_group argument replacing *args.
  Add `from __future__ import annotations` for consistent annotation style.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: get_parser factory forwards logging_group, drops progress_callback

consumer.py calls parser_class(logging_group, progress_callback=...).
RemoteDocumentParser.__init__ accepts logging_group but not
progress_callback, so only the latter is dropped — matching the pattern
established by the TextDocumentParser signals shim.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

* Fix: text parser get_parser forwards logging_group, drops progress_callback

TextDocumentParser.__init__ accepts logging_group: object = None, same
as RemoteDocumentParser. The old shim incorrectly dropped it; fix to
forward it as a positional arg and only drop progress_callback.
Add type annotations and from __future__ import annotations for
consistency with the remote parser signals shim.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

2026-03-18 16:19:46 -07:00

.devcontainer

Breaking: Remove pybzar as a barcode reader (#12065 )

2026-02-13 08:14:00 -08:00

.github

Chore(deps): Bump the actions group with 2 updates (#12377 )

2026-03-18 06:18:11 +00:00

docker

Feature: Convert Tika parser to the plugin system (#12333 )

2026-03-17 15:43:28 -07:00

docs

Security: validate outbound llm URLs and block internal endpoints

2026-03-16 22:58:16 -07:00

resources

New -ngx logo 2022

2022-02-26 20:14:24 -08:00

scripts

Chore: Remove some further old items (encryption passphrase and PNG handling) (#12290 )

2026-03-09 22:04:51 +00:00

src

Feature: Convert remote AI parser to plugin system (#12334 )

2026-03-18 16:19:46 -07:00

src-ui

Chore(deps-dev): Bump the frontend-jest-dependencies group (#12374 )

2026-03-18 01:15:52 +00:00

.codecov.yml

Breaking: Drop support for Python 3.10 (#12234 )

2026-03-04 15:03:33 -08:00

.dockerignore

Chore: Enable mypy checking in CI (#11991 )

2026-02-03 16:02:33 -08:00

.editorconfig

Breaking: Refactor advanced database settings to allow more user configuration (#12165 )

2026-02-27 14:37:26 -08:00

.env

Chore: Remove unneeded .env entry, revert crowdin action rm, reduce frequency

2023-12-02 08:24:17 -08:00

.gitignore

Chore: move to Zensical for docs (#12011 )

2026-02-07 10:58:55 -08:00

.hadolint.yml

Configure Hadolint in a single location for both hooks and CI

2022-07-19 13:54:33 -07:00

.mypy-baseline.txt

Enhancement: saved view sharing (#12142 )

2026-03-04 14:15:43 -08:00

.pre-commit-config.yaml

Chore(deps): Bump https://github.com/astral-sh/ruff-pre-commit (#12371 )

2026-03-18 06:25:40 +00:00

.prettierrc.js

Chore(deps): Bump the pre-commit-dependencies group with 4 updates (#12323 )

2026-03-12 16:29:57 +00:00

.pyrefly-baseline.json

Chore: Configure pyrefly as an alternative typing tool (#12003 )

2026-02-07 10:33:00 -08:00

.yamlfmt

Chore(deps): Bump bootstrap from 5.3.7 to 5.3.8 in /src-ui (#10740 )

2025-09-03 21:58:53 +00:00

CODE_OF_CONDUCT.md

Chore(deps-dev): Bump the development group across 1 directory with 2 updates (#6851 )

2024-05-29 07:04:01 +00:00

CODEOWNERS

Chore: Switch from pipenv to uv (#9251 )

2025-03-04 16:15:51 +00:00

CONTRIBUTING.md

Breaking: Drop support for Python 3.10 (#12234 )

2026-03-04 15:03:33 -08:00

crowdin.yml

Chore: Implement crowdin GHA (#4706 )

2023-12-01 17:44:33 -08:00

Dockerfile

docker(deps): bump astral-sh/uv (#12265 )

2026-03-10 17:27:06 +00:00

install-paperless-ngx.sh

Chore: fix Postgres compose volume mount path in install script (#11184 )

2025-10-26 14:40:37 +00:00

LICENSE

Initial commit

2015-12-20 12:54:28 +00:00

paperless-ngx.code-workspace

Chore: Enables pylance pytest integration, swaps around some test markers (#11930 )

2026-01-28 23:06:11 +00:00

paperless.conf.example

Feature: support split documents based on tag barcodes (#11645 )

2026-01-29 08:05:33 -08:00

pyproject.toml

Chore(deps): Update django-allauth[mfa,socialaccount] requirement (#12381 )

2026-03-18 03:55:03 +00:00

README.md

Documentation: update crowdin links (#9595 )

2025-04-09 08:01:21 -07:00

SECURITY.md

Create SECURITY.md

2024-02-15 23:38:33 -08:00

uv.lock

Bumps zensical to 0.0.26 to resolve the wheel building it tries to do (#12392 )

2026-03-18 22:53:34 +00:00

zensical.toml

Breaking: Refactor advanced database settings to allow more user configuration (#12165 )

2026-02-27 14:37:26 -08:00

README.md

Paperless-ngx

Paperless-ngx is a document management system that transforms your physical documents into a searchable online archive so you can keep, well, less paper.

Paperless-ngx is the official successor to the original Paperless & Paperless-ng projects and is designed to distribute the responsibility of advancing and supporting the project among a team of people. Consider joining us!

Thanks to the generous folks at DigitalOcean, a demo is available at demo.paperless-ngx.com using login demo / demo. Note: demo content is reset frequently and confidential information should not be uploaded.

Features
Getting started
Contributing
Related Projects
Important Note

This project is supported by:

Features

A full list of features and screenshots are available in the documentation.

Getting started

The easiest way to deploy paperless is docker compose. The files in the /docker/compose directory are configured to pull the image from the GitHub container registry.

If you'd like to jump right in, you can configure a docker compose environment with our install script:

bash -c "$(curl -L https://raw.githubusercontent.com/paperless-ngx/paperless-ngx/main/install-paperless-ngx.sh)"

More details and step-by-step guides for alternative installation methods can be found in the documentation.

Migrating from Paperless-ng is easy, just drop in the new docker image! See the documentation on migrating for more details.

Documentation

The documentation for Paperless-ngx is available at https://docs.paperless-ngx.com.

Contributing

If you feel like contributing to the project, please do! Bug fixes, enhancements, visual fixes etc. are always welcome. If you want to implement something big: Please start a discussion about that! The documentation has some basic information on how to get started.

Community Support

People interested in continuing the work on paperless-ngx are encouraged to reach out here on github and in the Matrix Room. If you would like to contribute to the project on an ongoing basis there are multiple teams (frontend, ci/cd, etc) that could use your help so please reach out!

Translation

Paperless-ngx is available in many languages that are coordinated on Crowdin. If you want to help out by translating paperless-ngx into your language, please head over to https://crowdin.com/project/paperless-ngx, and thank you! More details can be found in CONTRIBUTING.md.

Feature Requests

Feature requests can be submitted via GitHub Discussions, you can search for existing ideas, add your own and vote for the ones you care about.

Bugs

For bugs please open an issue or start a discussion if you have questions.

Please see the wiki for a user-maintained list of related projects and software that is compatible with Paperless-ngx.

Important Note

Document scanners are typically used to scan sensitive documents like your social insurance number, tax records, invoices, etc. Paperless-ngx should never be run on an untrusted host because information is stored in clear text without encryption. No guarantees are made regarding security (but we do try!) and you use the app at your own risk. The safest way to run Paperless-ngx is on a local server in your own home with backups in place.

Languages

PostScript 71.7%

Python 15.7%

TypeScript 9.7%

HTML 2.4%

SCSS 0.3%

README.md

Paperless-ngx

Features

Getting started

Documentation

Contributing

Community Support

Translation

Feature Requests

Bugs

Related Projects

Important Note