From 2bb7c7ae1708241567421e27bf608a3d7e2b2d21 Mon Sep 17 00:00:00 2001
From: Trenton H <797416+stumpylog@users.noreply.github.com>
Date: Tue, 31 Mar 2026 09:16:43 -0700
Subject: [PATCH] Chore: Document the parser plugin system (#12423)

Co-authored-by: shamoon <4887959+shamoon@users.noreply.github.com>
---
 .github/ISSUE_TEMPLATE/bug-report.yml |   1 +
 docs/advanced_usage.md                |  75 +++++
 docs/development.md                   | 457 ++++++++++++++++++++------
 3 files changed, 441 insertions(+), 92 deletions(-)

diff --git a/.github/ISSUE_TEMPLATE/bug-report.yml b/.github/ISSUE_TEMPLATE/bug-report.yml
index b6baf49bf..e87c3e0c6 100644
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -21,6 +21,7 @@ body:
         - [The installation instructions](https://docs.paperless-ngx.com/setup/#installation).
         - [Existing issues and discussions](https://github.com/paperless-ngx/paperless-ngx/search?q=&type=issues).
         - Disable any custom container initialization scripts, if using
+        - Remove any third-party parser plugins — issues caused by or requiring changes to a third-party plugin will be closed without investigation.
 
         If you encounter issues while installing or configuring Paperless-ngx, please post in the ["Support" section of the discussions](https://github.com/paperless-ngx/paperless-ngx/discussions/new?category=support).
   - type: textarea
diff --git a/docs/advanced_usage.md b/docs/advanced_usage.md
index 989604000..ee0cfddce 100644
--- a/docs/advanced_usage.md
+++ b/docs/advanced_usage.md
@@ -723,6 +723,81 @@ services:
 
 1. Note the `:ro` tag means the folder will be mounted as read only. This is for extra security against changes
 
+## Installing third-party parser plugins {#parser-plugins}
+
+Third-party parser plugins extend Paperless-ngx to support additional file
+formats. A plugin is a Python package that advertises itself under the
+`paperless_ngx.parsers` entry point group. Refer to the
+[developer documentation](development.md#making-custom-parsers) for how to
+create one.
+
+!!! warning "Third-party plugins are not officially supported"
+
+    The Paperless-ngx maintainers do not provide support for third-party
+    plugins. Issues caused by or requiring changes to a third-party plugin
+    will be closed without further investigation. Always reproduce problems
+    with all plugins removed before filing a bug report.
+
+### Docker
+
+Use a [custom container initialization script](#custom-container-initialization)
+to install the package before the webserver starts. Create a shell script and
+mount it into `/custom-cont-init.d`:
+
+```bash
+#!/bin/bash
+# /path/to/my/scripts/install-parsers.sh
+
+pip install my-paperless-parser-package
+```
+
+Mount it in your `docker-compose.yml`:
+
+```yaml
+services:
+  webserver:
+    # ...
+    volumes:
+      - /path/to/my/scripts:/custom-cont-init.d:ro
+```
+
+The script runs as `root` before the webserver starts, so the package will be
+available when Paperless-ngx discovers plugins at startup.
+
+### Bare metal
+
+Install the package into the same Python environment that runs Paperless-ngx.
+If you followed the standard bare-metal install guide, that is the `paperless`
+user's environment:
+
+```bash
+sudo -Hu paperless pip3 install my-paperless-parser-package
+```
+
+If you are using `uv` or a virtual environment, activate it first and then run:
+
+```bash
+uv pip install my-paperless-parser-package
+# or
+pip install my-paperless-parser-package
+```
+
+Restart all Paperless-ngx services after installation so the new plugin is
+discovered.
+
+### Verifying installation
+
+On the next startup, check the application logs for a line confirming
+discovery:
+
+```
+Loaded third-party parser 'My Parser' v1.0.0 by Acme Corp (entrypoint: 'my_parser').
+```
+
+If this line does not appear, verify that the package is installed in the
+correct environment and that its `pyproject.toml` declares the
+`paperless_ngx.parsers` entry point.
+
 ## MySQL Caveats {#mysql-caveats}
 
 ### Case Sensitivity
diff --git a/docs/development.md b/docs/development.md
index e6b9955e8..11e078a67 100644
--- a/docs/development.md
+++ b/docs/development.md
@@ -370,121 +370,367 @@ docker build --file Dockerfile --tag paperless:local .
 
 ## Extending Paperless-ngx
 
-Paperless-ngx does not have any fancy plugin systems and will probably never
-have. However, some parts of the application have been designed to allow
-easy integration of additional features without any modification to the
-base code.
+Paperless-ngx supports third-party document parsers via a Python entry point
+plugin system. Plugins are distributed as ordinary Python packages and
+discovered automatically at startup — no changes to the Paperless-ngx source
+are required.
+
+!!! warning "Third-party plugins are not officially supported"
+
+    The Paperless-ngx maintainers do not provide support for third-party
+    plugins. Issues that are caused by or require changes to a third-party
+    plugin will be closed without further investigation. If you believe you
+    have found a bug in Paperless-ngx itself (not in a plugin), please
+    reproduce it with all third-party plugins removed before filing an issue.
 
 ### Making custom parsers
 
-Paperless-ngx uses parsers to add documents. A parser is
-responsible for:
+Paperless-ngx uses parsers to add documents. A parser is responsible for:
 
-- Retrieving the content from the original
-- Creating a thumbnail
-- _optional:_ Retrieving a created date from the original
-- _optional:_ Creating an archived document from the original
+- Extracting plain-text content from the document
+- Generating a thumbnail image
+- _optional:_ Detecting the document's creation date
+- _optional:_ Producing a searchable PDF archive copy
 
-Custom parsers can be added to Paperless-ngx to support more file types. In
-order to do that, you need to write the parser itself and announce its
-existence to Paperless-ngx.
+Custom parsers are distributed as ordinary Python packages and registered
+via a [setuptools entry point](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
+No changes to the Paperless-ngx source are required.
 
-The parser itself must extend `documents.parsers.DocumentParser` and
-must implement the methods `parse` and `get_thumbnail`. You can provide
-your own implementation to `get_date` if you don't want to rely on
-Paperless-ngx' default date guessing mechanisms.
+#### 1. Implementing the parser class
+
+Your parser must satisfy the `ParserProtocol` structural interface defined in
+`paperless.parsers`. The simplest approach is to write a plain class — no base
+class is required, only the right attributes and methods.
+
+**Class-level identity attributes**
+
+The registry reads these before instantiating the parser, so they must be
+plain class attributes (not instance attributes or properties):
 
 ```python
-class MyCustomParser(DocumentParser):
-
-    def parse(self, document_path, mime_type):
-        # This method does not return anything. Rather, you should assign
-        # whatever you got from the document to the following fields:
-
-        # The content of the document.
-        self.text = "content"
-
-        # Optional: path to a PDF document that you created from the original.
-        self.archive_path = os.path.join(self.tempdir, "archived.pdf")
-
-        # Optional: "created" date of the document.
-        self.date = get_created_from_metadata(document_path)
-
-    def get_thumbnail(self, document_path, mime_type):
-        # This should return the path to a thumbnail you created for this
-        # document.
-        return os.path.join(self.tempdir, "thumb.webp")
+class MyCustomParser:
+    name    = "My Format Parser"   # human-readable name shown in logs
+    version = "1.0.0"              # semantic version string
+    author  = "Acme Corp"          # author / organisation
+    url     = "https://example.com/my-parser"  # docs or issue tracker
 ```
 
-If you encounter any issues during parsing, raise a
-`documents.parsers.ParseError`.
+**Declaring supported MIME types**
 
-The `self.tempdir` directory is a temporary directory that is guaranteed
-to be empty and removed after consumption finished. You can use that
-directory to store any intermediate files and also use it to store the
-thumbnail / archived document.
-
-After that, you need to announce your parser to Paperless-ngx. You need to
-connect a handler to the `document_consumer_declaration` signal. Have a
-look in the file `src/paperless_tesseract/apps.py` on how that's done.
-The handler is a method that returns information about your parser:
+Return a `dict` mapping MIME type strings to preferred file extensions
+(including the leading dot). Paperless-ngx uses the extension when storing
+archive copies and serving files for download.
 
 ```python
-def myparser_consumer_declaration(sender, **kwargs):
+@classmethod
+def supported_mime_types(cls) -> dict[str, str]:
     return {
-        "parser": MyCustomParser,
-        "weight": 0,
-        "mime_types": {
-            "application/pdf": ".pdf",
-            "image/jpeg": ".jpg",
-        }
+        "application/x-my-format": ".myf",
+        "application/x-my-format-alt": ".myf",
     }
 ```
 
-- `parser` is a reference to a class that extends `DocumentParser`.
-- `weight` is used whenever two or more parsers are able to parse a
-  file: The parser with the higher weight wins. This can be used to
-  override the parsers provided by Paperless-ngx.
-- `mime_types` is a dictionary. The keys are the mime types your
-  parser supports and the value is the default file extension that
-  Paperless-ngx should use when storing files and serving them for
-  download. We could guess that from the file extensions, but some
-  mime types have many extensions associated with them and the Python
-  methods responsible for guessing the extension do not always return
-  the same value.
+**Scoring**
 
-## Using Visual Studio Code devcontainer
+When more than one parser can handle a file, the registry calls `score()` on
+each candidate and picks the one with the highest result and equal scores favor third-party parsers over built-ins. Return `None` to
+decline handling a file even though the MIME type is listed as supported (for
+example, when a required external service is not configured).
 
-Another easy way to get started with development is to use Visual Studio
-Code devcontainers. This approach will create a preconfigured development
-environment with all of the required tools and dependencies.
-[Learn more about devcontainers](https://code.visualstudio.com/docs/devcontainers/containers).
-The .devcontainer/vscode/tasks.json and .devcontainer/vscode/launch.json files
-contain more information about the specific tasks and launch configurations (see the
-non-standard "description" field).
+| Score  | Meaning                                                                           |
+| ------ | --------------------------------------------------------------------------------- |
+| `None` | Decline — do not handle this file                                                 |
+| `10`   | Default priority used by all built-in parsers                                     |
+| `20`   | Priority used by the remote OCR built-in parser, allowing it to replace Tesseract |
+| `> 10` | Override a built-in parser for the same MIME type                                 |
 
-To get started:
+```python
+@classmethod
+def score(
+    cls,
+    mime_type: str,
+    filename: str,
+    path: "Path | None" = None,
+) -> int | None:
+    # Inspect filename or file bytes here if needed.
+    return 10
+```
 
-1. Clone the repository on your machine and open the Paperless-ngx folder in VS Code.
+**Archive and rendition flags**
 
-2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
+```python
+@property
+def can_produce_archive(self) -> bool:
+    """True if parse() can produce a searchable PDF archive copy."""
+    return True   # or False if your parser doesn't produce PDFs
 
-3. In case your host operating system is Windows:
-   - The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
-   - Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
+@property
+def requires_pdf_rendition(self) -> bool:
+    """True if the original format cannot be displayed by a browser
+    (e.g. DOCX, ODT) and the PDF output must always be kept."""
+    return False
+```
 
-4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
-   will initialize the database tables and create a superuser. Then you can compile the front end
-   for production or run the frontend in debug mode.
+**Context manager — temp directory lifecycle**
 
-5. The project is ready for debugging, start either run the fullstack debug or individual debug
-   processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
+Paperless-ngx always uses parsers as context managers. Create a temporary
+working directory in `__enter__` (or `__init__`) and remove it in `__exit__`
+regardless of whether an exception occurred. Store intermediate files,
+thumbnails, and archive PDFs inside this directory.
 
-## Developing Date Parser Plugins
+```python
+import shutil
+import tempfile
+from pathlib import Path
+from typing import Self
+from types import TracebackType
+
+from django.conf import settings
+
+class MyCustomParser:
+    ...
+
+    def __init__(self, logging_group: object = None) -> None:
+        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
+        self._tempdir = Path(
+            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR)
+        )
+        self._text: str | None = None
+        self._archive_path: Path | None = None
+
+    def __enter__(self) -> Self:
+        return self
+
+    def __exit__(
+        self,
+        exc_type: type[BaseException] | None,
+        exc_val: BaseException | None,
+        exc_tb: TracebackType | None,
+    ) -> None:
+        shutil.rmtree(self._tempdir, ignore_errors=True)
+```
+
+**Optional context — `configure()`**
+
+The consumer calls `configure()` with a `ParserContext` after instantiation
+and before `parse()`. If your parser doesn't need context, a no-op
+implementation is fine:
+
+```python
+from paperless.parsers import ParserContext
+
+def configure(self, context: ParserContext) -> None:
+    pass   # override if you need context.mailrule_id, etc.
+```
+
+**Parsing**
+
+`parse()` is the core method. It must not return a value; instead, store
+results in instance attributes and expose them via the accessor methods below.
+Raise `documents.parsers.ParseError` on any unrecoverable failure.
+
+```python
+from documents.parsers import ParseError
+
+def parse(
+    self,
+    document_path: Path,
+    mime_type: str,
+    *,
+    produce_archive: bool = True,
+) -> None:
+    try:
+        self._text = extract_text_from_my_format(document_path)
+    except Exception as e:
+        raise ParseError(f"Failed to parse {document_path}: {e}") from e
+
+    if produce_archive and self.can_produce_archive:
+        archive = self._tempdir / "archived.pdf"
+        convert_to_pdf(document_path, archive)
+        self._archive_path = archive
+```
+
+**Result accessors**
+
+```python
+def get_text(self) -> str | None:
+    return self._text
+
+def get_date(self) -> "datetime.datetime | None":
+    # Return a datetime extracted from the document, or None to let
+    # Paperless-ngx use its default date-guessing logic.
+    return None
+
+def get_archive_path(self) -> Path | None:
+    return self._archive_path
+
+def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
+    # If the format doesn't have the concept of pages, return None
+    return count_pages(document_path)
+
+```
+
+**Thumbnail**
+
+`get_thumbnail()` may be called independently of `parse()`. Return the path
+to a WebP image inside `self._tempdir`. The image should be roughly 500 × 700
+pixels.
+
+```python
+def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
+    thumb = self._tempdir / "thumb.webp"
+    render_thumbnail(document_path, thumb)
+    return thumb
+```
+
+**Optional methods**
+
+These are called by the API on demand, not during the consumption pipeline.
+Implement them if your format supports the information; otherwise return
+`None` / `[]`.
+
+```python
+
+def extract_metadata(
+    self,
+    document_path: Path,
+    mime_type: str,
+) -> "list[MetadataEntry]":
+    # Must never raise. Return [] if metadata cannot be read.
+    from paperless.parsers import MetadataEntry
+    return [
+        MetadataEntry(
+            namespace="https://example.com/ns/",
+            prefix="ex",
+            key="Author",
+            value="Alice",
+        )
+    ]
+```
+
+#### 2. Registering via entry point
+
+Add the following to your package's `pyproject.toml`. The key (left of `=`)
+is an arbitrary name used only in log output; the value is the
+`module:ClassName` import path.
+
+```toml
+[project.entry-points."paperless_ngx.parsers"]
+my_parser = "my_package.parsers:MyCustomParser"
+```
+
+Install your package into the same Python environment as Paperless-ngx (or
+add it to the Docker image), and the parser will be discovered automatically
+on the next startup. No configuration changes are needed.
+
+To verify discovery, check the application logs at startup for a line like:
+
+```
+Loaded third-party parser 'My Format Parser' v1.0.0 by Acme Corp (entrypoint: 'my_parser').
+```
+
+#### 3. Utilities
+
+`paperless.parsers.utils` provides helpers you can import directly:
+
+| Function                                | Description                                                      |
+| --------------------------------------- | ---------------------------------------------------------------- |
+| `read_file_handle_unicode_errors(path)` | Read a file as UTF-8, replacing invalid bytes instead of raising |
+| `get_page_count_for_pdf(path)`          | Count pages in a PDF using pikepdf                               |
+| `extract_pdf_metadata(path)`            | Extract XMP metadata from a PDF as a `list[MetadataEntry]`       |
+
+#### Minimal example
+
+A complete, working parser for a hypothetical plain-XML format:
+
+```python
+from __future__ import annotations
+
+import shutil
+import tempfile
+from pathlib import Path
+from typing import Self
+from types import TracebackType
+import xml.etree.ElementTree as ET
+
+from django.conf import settings
+
+from documents.parsers import ParseError
+from paperless.parsers import ParserContext
+
+
+class XmlDocumentParser:
+    name    = "XML Parser"
+    version = "1.0.0"
+    author  = "Acme Corp"
+    url     = "https://example.com/xml-parser"
+
+    @classmethod
+    def supported_mime_types(cls) -> dict[str, str]:
+        return {"application/xml": ".xml", "text/xml": ".xml"}
+
+    @classmethod
+    def score(cls, mime_type: str, filename: str, path: Path | None = None) -> int | None:
+        return 10
+
+    @property
+    def can_produce_archive(self) -> bool:
+        return False
+
+    @property
+    def requires_pdf_rendition(self) -> bool:
+        return False
+
+    def __init__(self, logging_group: object = None) -> None:
+        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
+        self._tempdir = Path(tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR))
+        self._text: str | None = None
+
+    def __enter__(self) -> Self:
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
+        shutil.rmtree(self._tempdir, ignore_errors=True)
+
+    def configure(self, context: ParserContext) -> None:
+        pass
+
+    def parse(self, document_path: Path, mime_type: str, *, produce_archive: bool = True) -> None:
+        try:
+            tree = ET.parse(document_path)
+            self._text = " ".join(tree.getroot().itertext())
+        except ET.ParseError as e:
+            raise ParseError(f"XML parse error: {e}") from e
+
+    def get_text(self) -> str | None:
+        return self._text
+
+    def get_date(self):
+        return None
+
+    def get_archive_path(self) -> Path | None:
+        return None
+
+    def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
+        from PIL import Image, ImageDraw
+        img = Image.new("RGB", (500, 700), color="white")
+        ImageDraw.Draw(img).text((10, 10), "XML Document", fill="black")
+        out = self._tempdir / "thumb.webp"
+        img.save(out, format="WEBP")
+        return out
+
+    def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
+        return None
+
+    def extract_metadata(self, document_path: Path, mime_type: str) -> list:
+        return []
+```
+
+### Developing date parser plugins
 
 Paperless-ngx uses a plugin system for date parsing, allowing you to extend or replace the default date parsing behavior. Plugins are discovered using [Python entry points](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
 
-### Creating a Date Parser Plugin
+#### Creating a Date Parser Plugin
 
 To create a custom date parser plugin, you need to:
 
@@ -492,7 +738,7 @@ To create a custom date parser plugin, you need to:
 2. Implement the required abstract method
 3. Register your plugin via an entry point
 
-#### 1. Implementing the Parser Class
+##### 1. Implementing the Parser Class
 
 Your parser must extend `documents.plugins.date_parsing.DateParserPluginBase` and implement the `parse` method:
 
@@ -532,7 +778,7 @@ class MyDateParserPlugin(DateParserPluginBase):
         yield another_datetime
 ```
 
-#### 2. Configuration and Helper Methods
+##### 2. Configuration and Helper Methods
 
 Your parser instance is initialized with a `DateParserConfig` object accessible via `self.config`. This provides:
 
@@ -565,11 +811,11 @@ def _filter_date(
     """
 ```
 
-#### 3. Resource Management (Optional)
+##### 3. Resource Management (Optional)
 
 If your plugin needs to acquire or release resources (database connections, API clients, etc.), override the context manager methods. Paperless-ngx will always use plugins as context managers, ensuring resources can be released even in the event of errors.
 
-#### 4. Registering Your Plugin
+##### 4. Registering Your Plugin
 
 Register your plugin using a setuptools entry point in your package's `pyproject.toml`:
 
@@ -580,7 +826,7 @@ my_parser = "my_package.parsers:MyDateParserPlugin"
 
 The entry point name (e.g., `"my_parser"`) is used for sorting when multiple plugins are found. Paperless-ngx will use the first plugin alphabetically by name if multiple plugins are discovered.
 
-### Plugin Discovery
+#### Plugin Discovery
 
 Paperless-ngx automatically discovers and loads date parser plugins at runtime. The discovery process:
 
@@ -591,7 +837,7 @@ Paperless-ngx automatically discovers and loads date parser plugins at runtime.
 
 If multiple plugins are installed, a warning is logged indicating which plugin was selected.
 
-### Example: Simple Date Parser
+#### Example: Simple Date Parser
 
 Here's a minimal example that only looks for ISO 8601 dates:
 
@@ -623,3 +869,30 @@ class ISODateParserPlugin(DateParserPluginBase):
             if filtered_date is not None:
                 yield filtered_date
 ```
+
+## Using Visual Studio Code devcontainer
+
+Another easy way to get started with development is to use Visual Studio
+Code devcontainers. This approach will create a preconfigured development
+environment with all of the required tools and dependencies.
+[Learn more about devcontainers](https://code.visualstudio.com/docs/devcontainers/containers).
+The .devcontainer/vscode/tasks.json and .devcontainer/vscode/launch.json files
+contain more information about the specific tasks and launch configurations (see the
+non-standard "description" field).
+
+To get started:
+
+1. Clone the repository on your machine and open the Paperless-ngx folder in VS Code.
+
+2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
+
+3. In case your host operating system is Windows:
+   - The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
+   - Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
+
+4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
+   will initialize the database tables and create a superuser. Then you can compile the front end
+   for production or run the frontend in debug mode.
+
+5. The project is ready for debugging, start either run the fullstack debug or individual debug
+   processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**