Moves the date parsing plugin section under the extending section

Adds a section about how the 2 install types can add external plugins
Inital documentation updates for developing a plugin
2026-03-21 08:25:59 +00:00 · 2026-03-20 15:05:13 -07:00 · 2026-03-20 14:55:17 -07:00 · 2026-03-20 14:48:45 -07:00
3 changed files with 439 additions and 92 deletions
--- a/.github/ISSUE_TEMPLATE/bug-report.yml
+++ b/.github/ISSUE_TEMPLATE/bug-report.yml
@@ -21,6 +21,7 @@ body:
        - [The installation instructions](https://docs.paperless-ngx.com/setup/#installation).
        - [Existing issues and discussions](https://github.com/paperless-ngx/paperless-ngx/search?q=&type=issues).
        - Disable any custom container initialization scripts, if using
        - Remove any third-party parser plugins — issues caused by or requiring changes to a third-party plugin will be closed without investigation.
        If you encounter issues while installing or configuring Paperless-ngx, please post in the ["Support" section of the discussions](https://github.com/paperless-ngx/paperless-ngx/discussions/new?category=support).
  - type: textarea
@@ -120,5 +121,7 @@ body:
          required: true
        - label: I have already searched for relevant existing issues and discussions before opening this report.
          required: true
        - label: I have reproduced this issue with all third-party parser plugins removed. I understand that issues caused by third-party plugins will be closed without investigation.
          required: true
        - label: I have updated the title field above with a concise description.
          required: true
--- a/docs/advanced_usage.md
+++ b/docs/advanced_usage.md
@@ -723,6 +723,81 @@ services:
 1. Note the `:ro` tag means the folder will be mounted as read only. This is for extra security against changes
 ## Installing third-party parser plugins {#parser-plugins}
 Third-party parser plugins extend Paperless-ngx to support additional file
 formats. A plugin is a Python package that advertises itself under the
 `paperless_ngx.parsers` entry point group. Refer to the
 [developer documentation](development.md#making-custom-parsers) for how to
 create one.
 !!! warning "Third-party plugins are not officially supported"
    The Paperless-ngx maintainers do not provide support for third-party
    plugins. Issues caused by or requiring changes to a third-party plugin
    will be closed without further investigation. Always reproduce problems
    with all plugins removed before filing a bug report.
 ### Docker
 Use a [custom container initialization script](#custom-container-initialization)
 to install the package before the webserver starts. Create a shell script and
 mount it into `/custom-cont-init.d`:
 ```bash
 #!/bin/bash
 # /path/to/my/scripts/install-parsers.sh
 pip install my-paperless-parser-package
 ```
 Mount it in your `docker-compose.yml`:
 ```yaml
 services:
  webserver:
    # ...
    volumes:
      - /path/to/my/scripts:/custom-cont-init.d:ro
 ```
 The script runs as `root` before the webserver starts, so the package will be
 available when Paperless-ngx discovers plugins at startup.
 ### Bare metal
 Install the package into the same Python environment that runs Paperless-ngx.
 If you followed the standard bare-metal install guide, that is the `paperless`
 user's environment:
 ```bash
 sudo -Hu paperless pip3 install my-paperless-parser-package
 ```
 If you are using `uv` or a virtual environment, activate it first and then run:
 ```bash
 uv pip install my-paperless-parser-package
 # or
 pip install my-paperless-parser-package
 ```
 Restart all Paperless-ngx services after installation so the new plugin is
 discovered.
 ### Verifying installation
 On the next startup, check the application logs for a line confirming
 discovery:
 ```
 Loaded third-party parser 'My Parser' v1.0.0 by Acme Corp (entrypoint: 'my_parser').
 ```
 If this line does not appear, verify that the package is installed in the
 correct environment and that its `pyproject.toml` declares the
 `paperless_ngx.parsers` entry point.
 ## MySQL Caveats {#mysql-caveats}
 ### Case Sensitivity
--- a/docs/development.md
+++ b/docs/development.md
@@ -370,121 +370,363 @@ docker build --file Dockerfile --tag paperless:local .
 ## Extending Paperless-ngx
-Paperless-ngx does not have any fancy plugin systems and will probably never
+Paperless-ngx supports third-party document parsers via a Python entry point
-have. However, some parts of the application have been designed to allow
+plugin system. Plugins are distributed as ordinary Python packages and
-easy integration of additional features without any modification to the
+discovered automatically at startup — no changes to the Paperless-ngx source
-base code.
+are required.
 !!! warning "Third-party plugins are not officially supported"
    The Paperless-ngx maintainers do not provide support for third-party
    plugins. Issues that are caused by or require changes to a third-party
    plugin will be closed without further investigation. If you believe you
    have found a bug in Paperless-ngx itself (not in a plugin), please
    reproduce it with all third-party plugins removed before filing an issue.
 ### Making custom parsers
-Paperless-ngx uses parsers to add documents. A parser is
+Paperless-ngx uses parsers to add documents. A parser is responsible for:
 responsible for:
- Retrieving the content from the original
+- Extracting plain-text content from the document
- Creating a thumbnail
+- Generating a thumbnail image
- _optional:_ Retrieving a created date from the original
+- _optional:_ Detecting the document's creation date
- _optional:_ Creating an archived document from the original
+- _optional:_ Producing a searchable PDF archive copy
-Custom parsers can be added to Paperless-ngx to support more file types. In
+Custom parsers are distributed as ordinary Python packages and registered
-order to do that, you need to write the parser itself and announce its
+via a [setuptools entry point](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
-existence to Paperless-ngx.
+No changes to the Paperless-ngx source are required.
-The parser itself must extend `documents.parsers.DocumentParser` and
+#### 1. Implementing the parser class
-must implement the methods `parse` and `get_thumbnail`. You can provide
+
-your own implementation to `get_date` if you don't want to rely on
+Your parser must satisfy the `ParserProtocol` structural interface defined in
-Paperless-ngx' default date guessing mechanisms.
+`paperless.parsers`. The simplest approach is to write a plain class — no base
 class is required, only the right attributes and methods.
 **Class-level identity attributes**
 The registry reads these before instantiating the parser, so they must be
 plain class attributes (not instance attributes or properties):
 ```python
-class MyCustomParser(DocumentParser):
+class MyCustomParser:
-
+    name    = "My Format Parser"   # human-readable name shown in logs
-    def parse(self, document_path, mime_type):
+    version = "1.0.0"              # semantic version string
-        # This method does not return anything. Rather, you should assign
+    author  = "Acme Corp"          # author / organisation
-        # whatever you got from the document to the following fields:
+    url     = "https://example.com/my-parser"  # docs or issue tracker
        # The content of the document.
        self.text = "content"
        # Optional: path to a PDF document that you created from the original.
        self.archive_path = os.path.join(self.tempdir, "archived.pdf")
        # Optional: "created" date of the document.
        self.date = get_created_from_metadata(document_path)
    def get_thumbnail(self, document_path, mime_type):
        # This should return the path to a thumbnail you created for this
        # document.
        return os.path.join(self.tempdir, "thumb.webp")
 ```
-If you encounter any issues during parsing, raise a
+**Declaring supported MIME types**
 `documents.parsers.ParseError`.
-The `self.tempdir` directory is a temporary directory that is guaranteed
+Return a `dict` mapping MIME type strings to preferred file extensions
-to be empty and removed after consumption finished. You can use that
+(including the leading dot). Paperless-ngx uses the extension when storing
-directory to store any intermediate files and also use it to store the
+archive copies and serving files for download.
 thumbnail / archived document.
 After that, you need to announce your parser to Paperless-ngx. You need to
 connect a handler to the `document_consumer_declaration` signal. Have a
 look in the file `src/paperless_tesseract/apps.py` on how that's done.
 The handler is a method that returns information about your parser:
 ```python
-def myparser_consumer_declaration(sender, **kwargs):
+@classmethod
 def supported_mime_types(cls) -> dict[str, str]:
    return {
-        "parser": MyCustomParser,
+        "application/x-my-format": ".myf",
-        "weight": 0,
+        "application/x-my-format-alt": ".myf",
        "mime_types": {
            "application/pdf": ".pdf",
            "image/jpeg": ".jpg",
        }
    }
 ```
- `parser` is a reference to a class that extends `DocumentParser`.
+**Scoring**
 - `weight` is used whenever two or more parsers are able to parse a
  file: The parser with the higher weight wins. This can be used to
  override the parsers provided by Paperless-ngx.
 - `mime_types` is a dictionary. The keys are the mime types your
  parser supports and the value is the default file extension that
  Paperless-ngx should use when storing files and serving them for
  download. We could guess that from the file extensions, but some
  mime types have many extensions associated with them and the Python
  methods responsible for guessing the extension do not always return
  the same value.
-## Using Visual Studio Code devcontainer
+When more than one parser can handle a file, the registry calls `score()` on
 each candidate and picks the one with the highest result. Return `None` to
 decline handling a file even though the MIME type is listed as supported (for
 example, when a required external service is not configured).
-Another easy way to get started with development is to use Visual Studio
+| Score  | Meaning                                           |
-Code devcontainers. This approach will create a preconfigured development
+| ------ | ------------------------------------------------- |
-environment with all of the required tools and dependencies.
+| `None` | Decline — do not handle this file                 |
-[Learn more about devcontainers](https://code.visualstudio.com/docs/devcontainers/containers).
+| `10`   | Default priority used by all built-in parsers     |
-The .devcontainer/vscode/tasks.json and .devcontainer/vscode/launch.json files
+| `> 10` | Override a built-in parser for the same MIME type |
 contain more information about the specific tasks and launch configurations (see the
 non-standard "description" field).
-To get started:
+```python
@classmethod
 def score(
    cls,
    mime_type: str,
    filename: str,
    path: "Path | None" = None,
 ) -> int | None:
    # Inspect filename or file bytes here if needed.
    return 10
 ```
-1. Clone the repository on your machine and open the Paperless-ngx folder in VS Code.
+**Archive and rendition flags**
-2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
+```python
@property
 def can_produce_archive(self) -> bool:
    """True if parse() can produce a searchable PDF archive copy."""
    return True   # or False if your parser doesn't produce PDFs
-3. In case your host operating system is Windows:
+@property
-   - The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
+def requires_pdf_rendition(self) -> bool:
-   - Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
+    """True if the original format cannot be displayed by a browser
    (e.g. DOCX, ODT) and the PDF output must always be kept."""
    return False
 ```
-4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
+**Context manager — temp directory lifecycle**
   will initialize the database tables and create a superuser. Then you can compile the front end
   for production or run the frontend in debug mode.
-5. The project is ready for debugging, start either run the fullstack debug or individual debug
+Paperless-ngx always uses parsers as context managers. Create a temporary
-   processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
+working directory in `__enter__` (or `__init__`) and remove it in `__exit__`
 regardless of whether an exception occurred. Store intermediate files,
 thumbnails, and archive PDFs inside this directory.
-## Developing Date Parser Plugins
+```python
 import shutil
 import tempfile
 from pathlib import Path
 from typing import Self
 from types import TracebackType
 from django.conf import settings
 class MyCustomParser:
    ...
    def __init__(self, logging_group: object = None) -> None:
        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
        self._tempdir = Path(
            tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR)
        )
        self._text: str | None = None
        self._archive_path: Path | None = None
    def __enter__(self) -> Self:
        return self
    def __exit__(
        self,
        exc_type: type[BaseException] | None,
        exc_val: BaseException | None,
        exc_tb: TracebackType | None,
    ) -> None:
        shutil.rmtree(self._tempdir, ignore_errors=True)
 ```
 **Optional context — `configure()`**
 The consumer calls `configure()` with a `ParserContext` after instantiation
 and before `parse()`. If your parser doesn't need context, a no-op
 implementation is fine:
 ```python
 from paperless.parsers import ParserContext
 def configure(self, context: ParserContext) -> None:
    pass   # override if you need context.mailrule_id, etc.
 ```
 **Parsing**
 `parse()` is the core method. It must not return a value; instead, store
 results in instance attributes and expose them via the accessor methods below.
 Raise `documents.parsers.ParseError` on any unrecoverable failure.
 ```python
 from documents.parsers import ParseError
 def parse(
    self,
    document_path: Path,
    mime_type: str,
    *,
    produce_archive: bool = True,
 ) -> None:
    try:
        self._text = extract_text_from_my_format(document_path)
    except Exception as e:
        raise ParseError(f"Failed to parse {document_path}: {e}") from e
    if produce_archive and self.can_produce_archive:
        archive = self._tempdir / "archived.pdf"
        convert_to_pdf(document_path, archive)
        self._archive_path = archive
 ```
 **Result accessors**
 ```python
 def get_text(self) -> str | None:
    return self._text
 def get_date(self) -> "datetime.datetime | None":
    # Return a datetime extracted from the document, or None to let
    # Paperless-ngx use its default date-guessing logic.
    return None
 def get_archive_path(self) -> Path | None:
    return self._archive_path
 ```
 **Thumbnail**
 `get_thumbnail()` may be called independently of `parse()`. Return the path
 to a WebP image inside `self._tempdir`. The image should be roughly 500 × 700
 pixels.
 ```python
 def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
    thumb = self._tempdir / "thumb.webp"
    render_thumbnail(document_path, thumb)
    return thumb
 ```
 **Optional methods**
 These are called by the API on demand, not during the consumption pipeline.
 Implement them if your format supports the information; otherwise return
 `None` / `[]`.
 ```python
 def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
    return count_pages(document_path)
 def extract_metadata(
    self,
    document_path: Path,
    mime_type: str,
 ) -> "list[MetadataEntry]":
    # Must never raise. Return [] if metadata cannot be read.
    from paperless.parsers import MetadataEntry
    return [
        MetadataEntry(
            namespace="https://example.com/ns/",
            prefix="ex",
            key="Author",
            value="Alice",
        )
    ]
 ```
 #### 2. Registering via entry point
 Add the following to your package's `pyproject.toml`. The key (left of `=`)
 is an arbitrary name used only in log output; the value is the
 `module:ClassName` import path.
 ```toml
 [project.entry-points."paperless_ngx.parsers"]
 my_parser = "my_package.parsers:MyCustomParser"
 ```
 Install your package into the same Python environment as Paperless-ngx (or
 add it to the Docker image), and the parser will be discovered automatically
 on the next startup. No configuration changes are needed.
 To verify discovery, check the application logs at startup for a line like:
 ```
 Loaded third-party parser 'My Format Parser' v1.0.0 by Acme Corp (entrypoint: 'my_parser').
 ```
 #### 3. Utilities
 `paperless.parsers.utils` provides helpers you can import directly:
 | Function                                | Description                                                      |
 | --------------------------------------- | ---------------------------------------------------------------- |
 | `read_file_handle_unicode_errors(path)` | Read a file as UTF-8, replacing invalid bytes instead of raising |
 | `get_page_count_for_pdf(path)`          | Count pages in a PDF using pikepdf                               |
 | `extract_pdf_metadata(path)`            | Extract XMP metadata from a PDF as a `list[MetadataEntry]`       |
 #### Minimal example
 A complete, working parser for a hypothetical plain-XML format:
 ```python
 from __future__ import annotations
 import shutil
 import tempfile
 from pathlib import Path
 from typing import Self
 from types import TracebackType
 import xml.etree.ElementTree as ET
 from django.conf import settings
 from documents.parsers import ParseError
 from paperless.parsers import ParserContext
 class XmlDocumentParser:
    name    = "XML Parser"
    version = "1.0.0"
    author  = "Acme Corp"
    url     = "https://example.com/xml-parser"
    @classmethod
    def supported_mime_types(cls) -> dict[str, str]:
        return {"application/xml": ".xml", "text/xml": ".xml"}
    @classmethod
    def score(cls, mime_type: str, filename: str, path: Path | None = None) -> int | None:
        return 10
    @property
    def can_produce_archive(self) -> bool:
        return False
    @property
    def requires_pdf_rendition(self) -> bool:
        return False
    def __init__(self, logging_group: object = None) -> None:
        settings.SCRATCH_DIR.mkdir(parents=True, exist_ok=True)
        self._tempdir = Path(tempfile.mkdtemp(prefix="paperless-", dir=settings.SCRATCH_DIR))
        self._text: str | None = None
    def __enter__(self) -> Self:
        return self
    def __exit__(self, exc_type, exc_val, exc_tb) -> None:
        shutil.rmtree(self._tempdir, ignore_errors=True)
    def configure(self, context: ParserContext) -> None:
        pass
    def parse(self, document_path: Path, mime_type: str, *, produce_archive: bool = True) -> None:
        try:
            tree = ET.parse(document_path)
            self._text = " ".join(tree.getroot().itertext())
        except ET.ParseError as e:
            raise ParseError(f"XML parse error: {e}") from e
    def get_text(self) -> str | None:
        return self._text
    def get_date(self):
        return None
    def get_archive_path(self) -> Path | None:
        return None
    def get_thumbnail(self, document_path: Path, mime_type: str) -> Path:
        from PIL import Image, ImageDraw
        img = Image.new("RGB", (500, 700), color="white")
        ImageDraw.Draw(img).text((10, 10), "XML Document", fill="black")
        out = self._tempdir / "thumb.webp"
        img.save(out, format="WEBP")
        return out
    def get_page_count(self, document_path: Path, mime_type: str) -> int | None:
        return None
    def extract_metadata(self, document_path: Path, mime_type: str) -> list:
        return []
 ```
 ### Developing date parser plugins
 Paperless-ngx uses a plugin system for date parsing, allowing you to extend or replace the default date parsing behavior. Plugins are discovered using [Python entry points](https://setuptools.pypa.io/en/latest/userguide/entry_point.html).
-### Creating a Date Parser Plugin
+#### Creating a Date Parser Plugin
 To create a custom date parser plugin, you need to:
@@ -492,7 +734,7 @@ To create a custom date parser plugin, you need to:
 2. Implement the required abstract method
 3. Register your plugin via an entry point
-#### 1. Implementing the Parser Class
+##### 1. Implementing the Parser Class
 Your parser must extend `documents.plugins.date_parsing.DateParserPluginBase` and implement the `parse` method:
@@ -532,7 +774,7 @@ class MyDateParserPlugin(DateParserPluginBase):
        yield another_datetime
 ```
-#### 2. Configuration and Helper Methods
+##### 2. Configuration and Helper Methods
 Your parser instance is initialized with a `DateParserConfig` object accessible via `self.config`. This provides:
@@ -565,11 +807,11 @@ def _filter_date(
    """
 ```
-#### 3. Resource Management (Optional)
+##### 3. Resource Management (Optional)
 If your plugin needs to acquire or release resources (database connections, API clients, etc.), override the context manager methods. Paperless-ngx will always use plugins as context managers, ensuring resources can be released even in the event of errors.
-#### 4. Registering Your Plugin
+##### 4. Registering Your Plugin
 Register your plugin using a setuptools entry point in your package's `pyproject.toml`:
@@ -580,7 +822,7 @@ my_parser = "my_package.parsers:MyDateParserPlugin"
 The entry point name (e.g., `"my_parser"`) is used for sorting when multiple plugins are found. Paperless-ngx will use the first plugin alphabetically by name if multiple plugins are discovered.
-### Plugin Discovery
+#### Plugin Discovery
 Paperless-ngx automatically discovers and loads date parser plugins at runtime. The discovery process:
@@ -591,7 +833,7 @@ Paperless-ngx automatically discovers and loads date parser plugins at runtime.
 If multiple plugins are installed, a warning is logged indicating which plugin was selected.
-### Example: Simple Date Parser
+#### Example: Simple Date Parser
 Here's a minimal example that only looks for ISO 8601 dates:
@@ -623,3 +865,30 @@ class ISODateParserPlugin(DateParserPluginBase):
            if filtered_date is not None:
                yield filtered_date
 ```
 ## Using Visual Studio Code devcontainer
 Another easy way to get started with development is to use Visual Studio
 Code devcontainers. This approach will create a preconfigured development
 environment with all of the required tools and dependencies.
 [Learn more about devcontainers](https://code.visualstudio.com/docs/devcontainers/containers).
 The .devcontainer/vscode/tasks.json and .devcontainer/vscode/launch.json files
 contain more information about the specific tasks and launch configurations (see the
 non-standard "description" field).
 To get started:
 1. Clone the repository on your machine and open the Paperless-ngx folder in VS Code.
 2. VS Code will prompt you with "Reopen in container". Do so and wait for the environment to start.
 3. In case your host operating system is Windows:
   - The Source Control view in Visual Studio Code might show: "The detected Git repository is potentially unsafe as the folder is owned by someone other than the current user." Use "Manage Unsafe Repositories" to fix this.
   - Git might have detecteded modifications for all files, because Windows is using CRLF line endings. Run `git checkout .` in the containers terminal to fix this issue.
 4. Initialize the project by running the task **Project Setup: Run all Init Tasks**. This
   will initialize the database tables and create a superuser. Then you can compile the front end
   for production or run the frontend in debug mode.
 5. The project is ready for debugging, start either run the fullstack debug or individual debug
   processes. Yo spin up the project without debugging run the task **Project Start: Run all Services**
Author	SHA1	Message	Date
Trenton H	f1fecfc2aa	Moves the date parsing plugin section under the extending section	2026-03-20 15:05:13 -07:00
Trenton H	dd01f5b263	Adds a section about how the 2 install types can add external plugins	2026-03-20 14:55:17 -07:00
Trenton H	4fd6963d27	Inital documentation updates for developing a plugin	2026-03-20 14:48:45 -07:00