Storing more ideas/plans

This commit is contained in:
Trenton Holmes
2026-06-12 11:09:26 -07:00
committed by stumpylog
parent 6a610e5f87
commit da02f3ef2d
7 changed files with 3425 additions and 0 deletions
@@ -0,0 +1,695 @@
# AI Taxonomy Candidate Injection Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Inject the user's existing taxonomy (tags, correspondents, document types, storage paths) as candidates into the LLM prompt so it prefers exact existing names over inventing new ones.
**Architecture:** A new `get_taxonomy_candidates(user)` helper fetches each category permission-filtered to the requesting user, annotated with document-count for frequency ordering, and capped at 200 per category. A private `_format_candidates_section` helper renders the candidate lists into a prompt appendix. `build_prompt_without_rag` and `build_prompt_with_rag` each gain an optional `candidates` parameter. `get_ai_document_classification` wires it all together — fetch candidates then pass them to the prompt builder. No changes to the view, matching layer, or response format.
**Tech Stack:** Django ORM (`annotate`, `Count`), `get_objects_for_user_owner_aware` (already used in `matching.py`), pytest + `unittest.mock`
---
## File Map
- **Modify:** `src/paperless_ai/ai_classifier.py`
- Add constant `TAXONOMY_CANDIDATE_LIMIT = 200`
- Add `get_taxonomy_candidates(user)` helper
- Add `_format_candidates_section(candidates)` helper
- Update `build_prompt_without_rag` signature and body
- Update `build_prompt_with_rag` signature and body
- Update `get_ai_document_classification` body
- **Create:** `src/paperless_ai/tests/test_taxonomy_candidates.py`
- All new tests for the above
---
### Task 1: `get_taxonomy_candidates` — tests + implementation
**Files:**
- Modify: `src/paperless_ai/ai_classifier.py`
- Create: `src/paperless_ai/tests/test_taxonomy_candidates.py`
- [ ] **Step 1: Write the failing tests**
Create `src/paperless_ai/tests/test_taxonomy_candidates.py`:
```python
import pytest
from unittest.mock import patch
from django.contrib.auth.models import User
from documents.models import Correspondent
from documents.models import Document
from documents.models import DocumentType
from documents.models import StoragePath
from documents.models import Tag
from paperless_ai.ai_classifier import TAXONOMY_CANDIDATE_LIMIT
from paperless_ai.ai_classifier import get_taxonomy_candidates
def test_get_taxonomy_candidates_returns_none_for_none_user():
assert get_taxonomy_candidates(None) is None
@pytest.mark.django_db
class TestGetTaxonomyCandidates:
def test_returns_dict_with_four_keys(self):
user = User.objects.create_user(username="tc_user1", password="x")
with patch(
"paperless_ai.ai_classifier.get_objects_for_user_owner_aware",
) as mock_get:
mock_get.side_effect = [
Tag.objects.none(),
Correspondent.objects.none(),
DocumentType.objects.none(),
StoragePath.objects.none(),
]
result = get_taxonomy_candidates(user)
assert result is not None
assert set(result.keys()) == {
"tags",
"correspondents",
"document_types",
"storage_paths",
}
def test_returns_names_as_strings(self):
user = User.objects.create_user(username="tc_user2", password="x")
tag = Tag.objects.create(name="Bloodwork")
with patch(
"paperless_ai.ai_classifier.get_objects_for_user_owner_aware",
) as mock_get:
mock_get.side_effect = [
Tag.objects.filter(pk=tag.pk),
Correspondent.objects.none(),
DocumentType.objects.none(),
StoragePath.objects.none(),
]
result = get_taxonomy_candidates(user)
assert result["tags"] == ["Bloodwork"]
def test_orders_tags_by_document_count_descending(self):
user = User.objects.create_user(username="tc_user3", password="x")
tag_low = Tag.objects.create(name="LowUse")
tag_high = Tag.objects.create(name="HighUse")
doc1 = Document.objects.create(mime_type="text/plain", checksum="tc_doc1")
doc2 = Document.objects.create(mime_type="text/plain", checksum="tc_doc2")
doc3 = Document.objects.create(mime_type="text/plain", checksum="tc_doc3")
doc1.tags.add(tag_high)
doc2.tags.add(tag_high)
doc3.tags.add(tag_low)
with patch(
"paperless_ai.ai_classifier.get_objects_for_user_owner_aware",
) as mock_get:
mock_get.side_effect = [
Tag.objects.filter(pk__in=[tag_low.pk, tag_high.pk]),
Correspondent.objects.none(),
DocumentType.objects.none(),
StoragePath.objects.none(),
]
result = get_taxonomy_candidates(user)
assert result["tags"] == ["HighUse", "LowUse"]
def test_caps_results_at_taxonomy_candidate_limit(self):
user = User.objects.create_user(username="tc_user4", password="x")
tags = [Tag.objects.create(name=f"Tag{i}") for i in range(5)]
with (
patch(
"paperless_ai.ai_classifier.get_objects_for_user_owner_aware",
) as mock_get,
patch("paperless_ai.ai_classifier.TAXONOMY_CANDIDATE_LIMIT", 3),
):
mock_get.side_effect = [
Tag.objects.filter(pk__in=[t.pk for t in tags]),
Correspondent.objects.none(),
DocumentType.objects.none(),
StoragePath.objects.none(),
]
result = get_taxonomy_candidates(user)
assert len(result["tags"]) == 3
def test_all_four_categories_are_fetched(self):
user = User.objects.create_user(username="tc_user5", password="x")
tag = Tag.objects.create(name="MyTag")
corr = Correspondent.objects.create(name="MyCorr")
dt = DocumentType.objects.create(name="MyType")
sp = StoragePath.objects.create(name="MyPath", path="/my/path")
with patch(
"paperless_ai.ai_classifier.get_objects_for_user_owner_aware",
) as mock_get:
mock_get.side_effect = [
Tag.objects.filter(pk=tag.pk),
Correspondent.objects.filter(pk=corr.pk),
DocumentType.objects.filter(pk=dt.pk),
StoragePath.objects.filter(pk=sp.pk),
]
result = get_taxonomy_candidates(user)
assert result["tags"] == ["MyTag"]
assert result["correspondents"] == ["MyCorr"]
assert result["document_types"] == ["MyType"]
assert result["storage_paths"] == ["MyPath"]
```
- [ ] **Step 2: Run to confirm they all fail**
```bash
cd src && uv run pytest paperless_ai/tests/test_taxonomy_candidates.py --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: `ImportError` or `FAILED``get_taxonomy_candidates` does not exist yet.
- [ ] **Step 3: Add the implementation to `ai_classifier.py`**
At the top of `src/paperless_ai/ai_classifier.py`, add new imports after the existing ones:
```python
from django.db.models import Count
from documents.models import Correspondent
from documents.models import DocumentType
from documents.models import StoragePath
from documents.models import Tag
from documents.permissions import get_objects_for_user_owner_aware
```
Add the constant and helper right after the `logger` line:
```python
TAXONOMY_CANDIDATE_LIMIT = 200
def get_taxonomy_candidates(user: User | None) -> dict[str, list[str]] | None:
if user is None:
return None
tags = (
get_objects_for_user_owner_aware(user, ["view_tag"], Tag)
.annotate(doc_count=Count("documents"))
.order_by("-doc_count")[:TAXONOMY_CANDIDATE_LIMIT]
)
correspondents = (
get_objects_for_user_owner_aware(user, ["view_correspondent"], Correspondent)
.annotate(doc_count=Count("documents"))
.order_by("-doc_count")[:TAXONOMY_CANDIDATE_LIMIT]
)
document_types = (
get_objects_for_user_owner_aware(user, ["view_documenttype"], DocumentType)
.annotate(doc_count=Count("documents"))
.order_by("-doc_count")[:TAXONOMY_CANDIDATE_LIMIT]
)
storage_paths = (
get_objects_for_user_owner_aware(user, ["view_storagepath"], StoragePath)
.annotate(doc_count=Count("documents"))
.order_by("-doc_count")[:TAXONOMY_CANDIDATE_LIMIT]
)
return {
"tags": [t.name for t in tags],
"correspondents": [c.name for c in correspondents],
"document_types": [d.name for d in document_types],
"storage_paths": [s.name for s in storage_paths],
}
```
- [ ] **Step 4: Run to confirm Task 1 tests pass**
```bash
cd src && uv run pytest paperless_ai/tests/test_taxonomy_candidates.py --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: all 6 tests `PASSED`.
- [ ] **Step 5: Confirm existing AI classifier tests still pass**
```bash
cd src && uv run pytest paperless_ai/tests/test_ai_classifier.py --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: all tests `PASSED`.
- [ ] **Step 6: Commit**
```bash
git add src/paperless_ai/ai_classifier.py src/paperless_ai/tests/test_taxonomy_candidates.py
git commit -m "feat: add get_taxonomy_candidates helper with frequency ordering and cap"
```
---
### Task 2: Prompt injection — `_format_candidates_section` + `build_prompt_without_rag`
**Files:**
- Modify: `src/paperless_ai/ai_classifier.py`
- Modify: `src/paperless_ai/tests/test_taxonomy_candidates.py`
- [ ] **Step 1: Write the failing tests**
Append to `src/paperless_ai/tests/test_taxonomy_candidates.py`:
```python
from unittest.mock import MagicMock
from paperless_ai.ai_classifier import build_prompt_without_rag
@pytest.fixture
def mock_doc():
doc = MagicMock(spec=Document)
doc.filename = "invoice.pdf"
doc.content = "Some document content."
return doc
class TestBuildPromptWithoutRag:
def test_no_candidates_section_when_candidates_is_none(self, mock_doc):
prompt = build_prompt_without_rag(mock_doc, candidates=None)
assert "Existing metadata" not in prompt
def test_no_candidates_section_when_candidates_is_empty_dict(self, mock_doc):
prompt = build_prompt_without_rag(mock_doc, candidates={})
assert "Existing metadata" not in prompt
def test_candidates_section_present_when_provided(self, mock_doc):
candidates = {
"tags": ["Bloodwork", "Insurance"],
"correspondents": ["Dr. Smith"],
"document_types": [],
"storage_paths": [],
}
prompt = build_prompt_without_rag(mock_doc, candidates=candidates)
assert "Existing metadata" in prompt
assert "Bloodwork" in prompt
assert "Dr. Smith" in prompt
def test_empty_categories_omitted_from_section(self, mock_doc):
candidates = {
"tags": ["Bloodwork"],
"correspondents": [],
"document_types": [],
"storage_paths": [],
}
prompt = build_prompt_without_rag(mock_doc, candidates=candidates)
assert "Correspondents:" not in prompt
assert "Document types:" not in prompt
assert "Storage paths:" not in prompt
def test_existing_prompt_content_preserved(self, mock_doc):
prompt = build_prompt_without_rag(mock_doc, candidates=None)
assert "invoice.pdf" in prompt
assert "Some document content." in prompt
```
- [ ] **Step 2: Run to confirm they fail**
```bash
cd src && uv run pytest paperless_ai/tests/test_taxonomy_candidates.py::TestBuildPromptWithoutRag --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: `FAILED``build_prompt_without_rag` doesn't accept `candidates` yet.
- [ ] **Step 3: Add `_format_candidates_section` and update `build_prompt_without_rag` in `ai_classifier.py`**
Add `_format_candidates_section` immediately after `get_taxonomy_candidates`:
```python
def _format_candidates_section(candidates: dict[str, list[str]]) -> str:
lines = [
"Existing metadata (use exact names where they fit; suggest new ones only if nothing matches):",
]
for key, label in [
("tags", "Tags"),
("correspondents", "Correspondents"),
("document_types", "Document types"),
("storage_paths", "Storage paths"),
]:
names = candidates.get(key, [])
if names:
lines.append(f"{label}: {', '.join(names)}")
return "\n".join(lines)
```
Replace the existing `build_prompt_without_rag`:
```python
def build_prompt_without_rag(
document: Document,
candidates: dict[str, list[str]] | None = None,
) -> str:
filename = document.filename or ""
content = truncate_content(document.content[:4000] or "")
prompt = f"""
You are a document classification assistant.
Analyze the following document and extract the following information:
- A short descriptive title
- Tags that reflect the content
- Names of people or organizations mentioned
- The type or category of the document
- Suggested folder paths for storing the document
- Up to 3 relevant dates in YYYY-MM-DD format
Filename:
{filename}
Content:
{content}
""".strip()
if candidates:
prompt += "\n\n" + _format_candidates_section(candidates)
return prompt
```
- [ ] **Step 4: Run to confirm Task 2 tests pass**
```bash
cd src && uv run pytest paperless_ai/tests/test_taxonomy_candidates.py::TestBuildPromptWithoutRag --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: all 5 tests `PASSED`.
- [ ] **Step 5: Run full test file to check no regressions**
```bash
cd src && uv run pytest paperless_ai/tests/ --override-ini="addopts=" -v 2>&1 | tail -30
```
Expected: all tests `PASSED`.
- [ ] **Step 6: Commit**
```bash
git add src/paperless_ai/ai_classifier.py src/paperless_ai/tests/test_taxonomy_candidates.py
git commit -m "feat: inject taxonomy candidates into build_prompt_without_rag"
```
---
### Task 3: Update `build_prompt_with_rag`
**Files:**
- Modify: `src/paperless_ai/ai_classifier.py`
- Modify: `src/paperless_ai/tests/test_taxonomy_candidates.py`
- [ ] **Step 1: Write the failing tests**
Append to `src/paperless_ai/tests/test_taxonomy_candidates.py`:
```python
from paperless_ai.ai_classifier import build_prompt_with_rag
class TestBuildPromptWithRag:
def test_no_candidates_section_when_candidates_is_none(self, mock_doc):
with patch(
"paperless_ai.ai_classifier.get_context_for_document",
return_value="similar doc context",
):
prompt = build_prompt_with_rag(mock_doc, candidates=None)
assert "Existing metadata" not in prompt
def test_candidates_section_present_when_provided(self, mock_doc):
candidates = {
"tags": ["Insurance"],
"correspondents": [],
"document_types": ["Invoice"],
"storage_paths": [],
}
with patch(
"paperless_ai.ai_classifier.get_context_for_document",
return_value="similar doc context",
):
prompt = build_prompt_with_rag(mock_doc, candidates=candidates)
assert "Existing metadata" in prompt
assert "Insurance" in prompt
assert "Invoice" in prompt
def test_rag_context_still_present(self, mock_doc):
with patch(
"paperless_ai.ai_classifier.get_context_for_document",
return_value="similar doc context",
):
prompt = build_prompt_with_rag(mock_doc, candidates=None)
assert "similar doc context" in prompt
```
- [ ] **Step 2: Run to confirm they fail**
```bash
cd src && uv run pytest paperless_ai/tests/test_taxonomy_candidates.py::TestBuildPromptWithRag --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: `FAILED``build_prompt_with_rag` doesn't accept `candidates` yet.
- [ ] **Step 3: Update `build_prompt_with_rag` in `ai_classifier.py`**
Replace the existing `build_prompt_with_rag`:
```python
def build_prompt_with_rag(
document: Document,
user: User | None = None,
candidates: dict[str, list[str]] | None = None,
) -> str:
base_prompt = build_prompt_without_rag(document, candidates)
context = truncate_content(get_context_for_document(document, user))
return f"""{base_prompt}
Additional context from similar documents:
{context}
""".strip()
```
- [ ] **Step 4: Run to confirm Task 3 tests pass**
```bash
cd src && uv run pytest paperless_ai/tests/test_taxonomy_candidates.py::TestBuildPromptWithRag --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: all 3 tests `PASSED`.
- [ ] **Step 5: Run full test file to check no regressions**
```bash
cd src && uv run pytest paperless_ai/tests/ --override-ini="addopts=" -v 2>&1 | tail -30
```
Expected: all tests `PASSED`.
- [ ] **Step 6: Commit**
```bash
git add src/paperless_ai/ai_classifier.py src/paperless_ai/tests/test_taxonomy_candidates.py
git commit -m "feat: pass taxonomy candidates through build_prompt_with_rag"
```
---
### Task 4: Wire `get_ai_document_classification`
**Files:**
- Modify: `src/paperless_ai/ai_classifier.py`
- Modify: `src/paperless_ai/tests/test_taxonomy_candidates.py`
- [ ] **Step 1: Write the failing tests**
Append to `src/paperless_ai/tests/test_taxonomy_candidates.py`:
```python
from django.test import override_settings
from paperless_ai.ai_classifier import get_ai_document_classification
@pytest.mark.django_db
class TestGetAiDocumentClassificationCandidateWiring:
@override_settings(LLM_BACKEND="ollama", LLM_MODEL="some_model")
def test_candidates_fetched_and_passed_when_user_provided(self, mock_doc):
user = User.objects.create_user(username="tc_wire_user1", password="x")
fake_candidates = {
"tags": ["Bloodwork"],
"correspondents": [],
"document_types": [],
"storage_paths": [],
}
with (
patch(
"paperless_ai.ai_classifier.get_taxonomy_candidates",
return_value=fake_candidates,
) as mock_candidates,
patch(
"paperless_ai.ai_classifier.build_prompt_without_rag",
return_value="prompt",
) as mock_build,
patch("paperless_ai.client.AIClient.run_llm_query") as mock_llm,
):
mock_llm.return_value = {
"title": "",
"tags": [],
"correspondents": [],
"document_types": [],
"storage_paths": [],
"dates": [],
}
get_ai_document_classification(mock_doc, user)
mock_candidates.assert_called_once_with(user)
mock_build.assert_called_once_with(mock_doc, fake_candidates)
@override_settings(LLM_BACKEND="ollama", LLM_MODEL="some_model")
def test_no_candidates_when_user_is_none(self, mock_doc):
with (
patch(
"paperless_ai.ai_classifier.get_taxonomy_candidates",
) as mock_candidates,
patch(
"paperless_ai.ai_classifier.build_prompt_without_rag",
return_value="prompt",
) as mock_build,
patch("paperless_ai.client.AIClient.run_llm_query") as mock_llm,
):
mock_llm.return_value = {
"title": "",
"tags": [],
"correspondents": [],
"document_types": [],
"storage_paths": [],
"dates": [],
}
get_ai_document_classification(mock_doc, user=None)
mock_candidates.assert_not_called()
mock_build.assert_called_once_with(mock_doc, None)
@override_settings(
LLM_BACKEND="ollama",
LLM_MODEL="some_model",
LLM_EMBEDDING_BACKEND="huggingface",
LLM_EMBEDDING_MODEL="some_model",
)
def test_candidates_passed_to_rag_prompt_when_embedding_configured(self, mock_doc):
user = User.objects.create_user(username="tc_wire_user2", password="x")
fake_candidates = {
"tags": ["Tax"],
"correspondents": [],
"document_types": [],
"storage_paths": [],
}
with (
patch(
"paperless_ai.ai_classifier.get_taxonomy_candidates",
return_value=fake_candidates,
),
patch(
"paperless_ai.ai_classifier.build_prompt_with_rag",
return_value="rag_prompt",
) as mock_rag,
patch("paperless_ai.client.AIClient.run_llm_query") as mock_llm,
):
mock_llm.return_value = {
"title": "",
"tags": [],
"correspondents": [],
"document_types": [],
"storage_paths": [],
"dates": [],
}
get_ai_document_classification(mock_doc, user)
mock_rag.assert_called_once_with(mock_doc, user, fake_candidates)
```
- [ ] **Step 2: Run to confirm they fail**
```bash
cd src && uv run pytest paperless_ai/tests/test_taxonomy_candidates.py::TestGetAiDocumentClassificationCandidateWiring --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: `FAILED``get_ai_document_classification` doesn't pass candidates yet.
- [ ] **Step 3: Update `get_ai_document_classification` in `ai_classifier.py`**
Replace the existing `get_ai_document_classification`:
```python
def get_ai_document_classification(
document: Document,
user: User | None = None,
) -> dict:
ai_config = AIConfig()
candidates = get_taxonomy_candidates(user) if user is not None else None
prompt = (
build_prompt_with_rag(document, user, candidates)
if ai_config.llm_embedding_backend
else build_prompt_without_rag(document, candidates)
)
client = AIClient()
result = client.run_llm_query(prompt)
return parse_ai_response(result)
```
- [ ] **Step 4: Run Task 4 tests**
```bash
cd src && uv run pytest paperless_ai/tests/test_taxonomy_candidates.py::TestGetAiDocumentClassificationCandidateWiring --override-ini="addopts=" -v 2>&1 | tail -20
```
Expected: all 3 tests `PASSED`.
- [ ] **Step 5: Run the full `paperless_ai` test suite**
```bash
cd src && uv run pytest paperless_ai/tests/ --override-ini="addopts=" -v 2>&1 | tail -40
```
Expected: all tests `PASSED`.
- [ ] **Step 6: Commit**
```bash
git add src/paperless_ai/ai_classifier.py src/paperless_ai/tests/test_taxonomy_candidates.py
git commit -m "feat: wire taxonomy candidates into get_ai_document_classification"
```
---
### Task 5: Final verification
- [ ] **Step 1: Run the broader backend test suite to catch any regressions**
```bash
cd src && uv run pytest documents/tests/test_api_documents.py documents/tests/test_views.py paperless_ai/tests/ --override-ini="addopts=" -q 2>&1 | tail -20
```
Expected: all `PASSED`, no errors.
- [ ] **Step 2: Verify `ai_classifier.py` import order follows project conventions**
Project convention: stdlib → Django → third-party → local, alphabetical within each group. Open `src/paperless_ai/ai_classifier.py` and confirm the new imports (`Count`, model imports, `get_objects_for_user_owner_aware`) are placed in the correct groups in alphabetical order.
- [ ] **Step 3: Final commit if any formatting fixes were needed**
If Step 2 required changes:
```bash
git add src/paperless_ai/ai_classifier.py
git commit -m "chore: fix import ordering in ai_classifier.py"
```
File diff suppressed because it is too large Load Diff
@@ -0,0 +1,462 @@
# Unicode NFC Normalization for Filesystem Paths Implementation Plan
> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.
**Goal:** Ensure all filesystem paths stored in the database and written to disk use NFC Unicode normalization, preventing "file not found" failures caused by byte-level mismatches between visually identical filenames (e.g., NFD `ü` = `u + combining diaeresis` vs NFC `ü` = single codepoint U+00FC).
**Architecture:** The fix has two layers. The primary fix normalizes the output of `clean_filepath()` in `FilePathTemplate.render()` — this is the single choke point through which all template-rendered filenames pass. Defense-in-depth changes normalize input strings before `pathvalidate.sanitize_filename()` in the context builder functions. A separate fix normalizes mail attachment filenames at the entry point. Existing documents with NFD paths will be transparently migrated to NFC on their next save (the file move logic already handles the case where old and new paths differ).
**Tech Stack:** Python `unicodedata.normalize('NFC', ...)`, `pathvalidate`, Django, Jinja2, pytest
---
## Background: The Bug
`pathvalidate.sanitize_filename()` removes illegal filesystem characters but does **not** normalize Unicode. NFC `ü` (UTF-8: `c3 bc`) and NFD `ü` (UTF-8: `75 cc 88`) are visually identical but produce different byte sequences. On Linux filesystems with no normalization (default ZFS, ext4), these are treated as distinct filenames. If an LLM or OCR engine produces NFD text for a document title, the generated filesystem path contains NFD bytes. If the same title is later regenerated in NFC form (LLM output is non-deterministic), the path lookup fails: `old_source_path.is_file()` returns `False` even though a file with the same visual name exists on disk.
## File Structure
| File | Change |
| ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `src/documents/templating/filepath.py` | Add NFC normalization in `clean_filepath()` (primary fix) + input normalization in `get_basic_metadata_context()`, `get_tags_context()`, `get_custom_fields_context()` (defense-in-depth) |
| `src/paperless_mail/mail.py` | Normalize attachment filenames before `pathvalidate.sanitize_filename()` |
| `src/documents/tests/test_file_handling.py` | Tests for NFC normalization in `generate_filename()` |
| `src/paperless_mail/tests/test_mail.py` | Tests for NFC normalization in mail attachment handling |
---
## Task 1: Normalize `clean_filepath()` output (primary fix)
This is the single choke point. ALL template-rendered paths pass through `clean_filepath()` before being stored in `document.filename`. Fixing this alone prevents the bug for every path generated via the filename format system — including `{{ title }}` (sanitized context), `{{ document.title }}` (raw context), `{{ correspondent }}`, and every other template variable.
**Files:**
- Modify: `src/documents/templating/filepath.py:36-48`
- Test: `src/documents/tests/test_file_handling.py`
- [ ] **Step 1: Write failing tests**
Add these tests to `src/documents/tests/test_file_handling.py`, inside `class TestFileHandling`:
```python
import unicodedata
@override_settings(FILENAME_FORMAT="{{ title }}")
def test_generate_filename_nfc_normalizes_nfd_title(self) -> None:
"""NFD title (u + combining diaeresis) must produce NFC path bytes."""
nfd_title = unicodedata.normalize("NFD", "Gemüse")
nfc_title = unicodedata.normalize("NFC", "Gemüse")
assert nfd_title != nfc_title # confirm inputs differ at byte level
doc = Document.objects.create(title=nfd_title, mime_type="application/pdf")
result = generate_filename(doc)
assert str(result) == f"{nfc_title}.pdf"
assert str(result).encode() == f"{nfc_title}.pdf".encode()
@override_settings(FILENAME_FORMAT="{{ correspondent }}/{{ title }}")
def test_generate_filename_nfc_normalizes_nfd_correspondent(self) -> None:
"""NFD correspondent name must produce NFC path component."""
nfd_name = unicodedata.normalize("NFD", "Müller")
nfc_name = unicodedata.normalize("NFC", "Müller")
correspondent = Correspondent.objects.create(name=nfd_name)
doc = Document.objects.create(
title="invoice",
correspondent=correspondent,
mime_type="application/pdf",
)
result = generate_filename(doc)
assert str(result) == f"{nfc_name}/invoice.pdf"
assert str(result).encode() == f"{nfc_name}/invoice.pdf".encode()
@override_settings(FILENAME_FORMAT="{{ document.title }}")
def test_generate_filename_nfc_normalizes_raw_document_title_in_template(self) -> None:
"""NFD title accessed via document.title (unsanitized context) must also be NFC."""
nfd_title = unicodedata.normalize("NFD", "Café")
nfc_title = unicodedata.normalize("NFC", "Café")
doc = Document.objects.create(title=nfd_title, mime_type="application/pdf")
result = generate_filename(doc)
assert str(result) == f"{nfc_title}.pdf"
assert str(result).encode() == f"{nfc_title}.pdf".encode()
```
- [ ] **Step 2: Run tests to verify they fail**
```bash
uv run pytest --override-ini="addopts=" src/documents/tests/test_file_handling.py::TestFileHandling::test_generate_filename_nfc_normalizes_nfd_title src/documents/tests/test_file_handling.py::TestFileHandling::test_generate_filename_nfc_normalizes_nfd_correspondent src/documents/tests/test_file_handling.py::TestFileHandling::test_generate_filename_nfc_normalizes_raw_document_title_in_template -v
```
Expected: all three FAIL (NFD title produces NFD path, assertion fails).
- [ ] **Step 3: Add NFC normalization to `clean_filepath()`**
In `src/documents/templating/filepath.py`, add `import unicodedata` at the top of the file and modify `clean_filepath()`:
```python
import unicodedata # add to top-of-file imports
class FilePathTemplate(Template):
def render(self, *args, **kwargs) -> str:
def clean_filepath(value: str) -> str:
"""
Clean up a filepath by:
1. Normalizing to NFC Unicode form to prevent byte-level mismatches
between visually identical filenames on case-sensitive filesystems
2. Removing newlines and carriage returns
3. Removing extra spaces before and after forward slashes
4. Preserving spaces in other parts of the path
"""
value = unicodedata.normalize("NFC", value)
value = value.replace("\n", "").replace("\r", "")
value = re.sub(r"\s*/\s*", "/", value)
# We remove trailing and leading separators, as these are always relative paths, not absolute, even if the user
# tries
return value.strip().strip(os.sep)
original_render = super().render(*args, **kwargs)
return clean_filepath(original_render)
```
- [ ] **Step 4: Run tests to verify they pass**
```bash
uv run pytest --override-ini="addopts=" src/documents/tests/test_file_handling.py::TestFileHandling::test_generate_filename_nfc_normalizes_nfd_title src/documents/tests/test_file_handling.py::TestFileHandling::test_generate_filename_nfc_normalizes_nfd_correspondent src/documents/tests/test_file_handling.py::TestFileHandling::test_generate_filename_nfc_normalizes_raw_document_title_in_template -v
```
Expected: all three PASS.
- [ ] **Step 5: Run the full file-handling test suite to check for regressions**
```bash
uv run pytest --override-ini="addopts=" src/documents/tests/test_file_handling.py -v
```
Expected: all existing tests continue to pass (ASCII titles are unaffected by NFC normalization).
- [ ] **Step 6: Commit**
```bash
git add src/documents/templating/filepath.py src/documents/tests/test_file_handling.py
git commit -m "Fix: normalize filesystem paths to NFC Unicode to prevent byte-level mismatches"
```
---
## Task 2: Defense-in-depth normalization in context builders
`clean_filepath()` (Task 1) fixes the rendered path. These changes normalize the input strings that go into `pathvalidate.sanitize_filename()` within the context builders — belt-and-suspenders so the sanitized shorthand variables (`{{ title }}`, `{{ correspondent }}`, `{{ tag_list }}`, `{{ custom_fields }}`) are also NFC before sanitization. This matters because the sanitized strings could theoretically be compared directly against DB-stored values in other contexts.
**Files:**
- Modify: `src/documents/templating/filepath.py:171-319`
- Test: `src/documents/tests/test_file_handling.py`
- [ ] **Step 1: Write failing tests**
Add these tests to `TestFileHandling` in `src/documents/tests/test_file_handling.py`:
```python
@override_settings(FILENAME_FORMAT="{{ tag_list }}/{{ title }}")
def test_generate_filename_nfc_normalizes_nfd_tag_list(self) -> None:
"""NFD tag names must produce NFC path component in tag_list."""
nfd_name = unicodedata.normalize("NFD", "Büro")
nfc_name = unicodedata.normalize("NFC", "Büro")
doc = Document.objects.create(title="doc", mime_type="application/pdf")
doc.tags.create(name=nfd_name)
result = generate_filename(doc)
assert str(result) == f"{nfc_name}/doc.pdf"
assert str(result).encode() == f"{nfc_name}/doc.pdf".encode()
```
- [ ] **Step 2: Run test to verify it fails**
```bash
uv run pytest --override-ini="addopts=" src/documents/tests/test_file_handling.py::TestFileHandling::test_generate_filename_nfc_normalizes_nfd_tag_list -v
```
Expected: FAIL. (The tag_list is already caught by `clean_filepath()` from Task 1, but we want a test that directly validates input normalization through the sanitize call.)
Note: this test may already pass after Task 1 due to `clean_filepath()`. If so, keep the test as a regression guard and move straight to the implementation.
- [ ] **Step 3: Normalize inputs in `get_basic_metadata_context()`**
In `src/documents/templating/filepath.py`, update `get_basic_metadata_context()`. The `unicodedata` import was added in Task 1.
```python
def get_basic_metadata_context(
document: Document,
*,
no_value_default: str = NO_VALUE_PLACEHOLDER,
) -> dict[str, str]:
"""
Given a Document, constructs some basic information about it. If certain values are not set,
they will be replaced with the no_value_default.
Regardless of set or not, the values will be sanitized
"""
return {
"title": pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", document.title),
replacement_text="-",
),
"correspondent": pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", document.correspondent.name),
replacement_text="-",
)
if document.correspondent
else no_value_default,
"document_type": pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", document.document_type.name),
replacement_text="-",
)
if document.document_type
else no_value_default,
"asn": str(document.archive_serial_number)
if document.archive_serial_number
else no_value_default,
"owner_username": document.owner.username
if document.owner
else no_value_default,
"original_name": PurePath(document.original_filename).with_suffix("").name
if document.original_filename
else no_value_default,
"doc_pk": f"{document.pk:07}",
}
```
- [ ] **Step 4: Normalize inputs in `get_tags_context()`**
Update `get_tags_context()` in the same file:
```python
def get_tags_context(tags: Iterable[Tag]) -> dict[str, str | list[str]]:
"""
Given an Iterable of tags, constructs some context from them for usage
"""
return {
"tag_list": pathvalidate.sanitize_filename(
",".join(
sorted(unicodedata.normalize("NFC", tag.name) for tag in tags),
),
replacement_text="-",
),
# Assumed to be ordered, but a template could loop through to find what they want
"tag_name_list": [unicodedata.normalize("NFC", x.name) for x in tags],
}
```
- [ ] **Step 5: Normalize string-type inputs in `get_custom_fields_context()`**
Update `get_custom_fields_context()` in the same file. Only string-type fields (MONETARY, STRING, URL, LONG_TEXT, SELECT) go through `sanitize_filename()`; the others (dates, numbers, booleans) cannot contain non-ASCII unicode. Also normalize the field name itself.
```python
def get_custom_fields_context(
custom_fields: Iterable[CustomFieldInstance],
) -> dict[str, dict[str, dict[str, str]]]:
"""
Given an Iterable of CustomFieldInstance, builds a dictionary mapping the field name
to its type and value
"""
field_data = {"custom_fields": {}}
for field_instance in custom_fields:
type_ = pathvalidate.sanitize_filename(
field_instance.field.data_type,
replacement_text="-",
)
if field_instance.value is None:
value = None
# String types need to be sanitized
elif field_instance.field.data_type in {
CustomField.FieldDataType.MONETARY,
CustomField.FieldDataType.STRING,
CustomField.FieldDataType.URL,
CustomField.FieldDataType.LONG_TEXT,
}:
value = pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", field_instance.value),
replacement_text="-",
)
elif (
field_instance.field.data_type == CustomField.FieldDataType.SELECT
and field_instance.field.extra_data["select_options"] is not None
):
options = field_instance.field.extra_data["select_options"]
value = pathvalidate.sanitize_filename(
unicodedata.normalize(
"NFC",
next(
option["label"]
for option in options
if option["id"] == field_instance.value
),
),
replacement_text="-",
)
else:
value = field_instance.value
field_data["custom_fields"][
pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", field_instance.field.name),
replacement_text="-",
)
] = {
"type": type_,
"value": value,
}
return field_data
```
- [ ] **Step 6: Run the new test and full test suite**
```bash
uv run pytest --override-ini="addopts=" src/documents/tests/test_file_handling.py -v
```
Expected: all tests pass, including the new tag test.
- [ ] **Step 7: Commit**
```bash
git add src/documents/templating/filepath.py src/documents/tests/test_file_handling.py
git commit -m "Fix: normalize context builder inputs to NFC before sanitize_filename (defense-in-depth)"
```
---
## Task 3: Normalize mail attachment filenames
Email attachment filenames come from MIME headers and can be in any Unicode normalization depending on the sending client. These flow into `document.original_filename` and then into `{{ original_name }}` template context. They also become the temp file name created on disk.
**Files:**
- Modify: `src/paperless_mail/mail.py`
- Test: `src/paperless_mail/tests/test_mail.py`
- [ ] **Step 1: Find the exact lines in mail.py**
```bash
grep -n "sanitize_filename" src/paperless_mail/mail.py
```
Expected output (line numbers may vary):
```
NNN: attachment_name = pathvalidate.sanitize_filename(att.filename)
NNN: filename=pathvalidate.sanitize_filename(att.filename),
NNN: filename=pathvalidate.sanitize_filename(f"{message.subject}.eml"),
```
Note the line numbers for the next step.
- [ ] **Step 2: Write a failing test**
Find an existing test in `src/paperless_mail/tests/test_mail.py` that exercises attachment filename handling (search for `sanitize_filename` or `att.filename` in that file to find a good base test to copy). Add a new test that uses an NFD attachment filename.
The following test goes into the appropriate `TestCase` class in `src/paperless_mail/tests/test_mail.py`. Look at the file first to confirm the right class and mock patterns — the test below follows the existing pattern for mocking `MailMessage` and `Attachment` objects:
```python
def test_attachment_filename_nfd_normalized_to_nfc(self) -> None:
"""Mail attachment filenames with NFD encoding must be normalized to NFC."""
import unicodedata
nfd_name = unicodedata.normalize("NFD", "Rechnung März.pdf")
nfc_name = unicodedata.normalize("NFC", "Rechnung März.pdf")
assert nfd_name != nfc_name # confirm inputs differ at byte level
# Use whatever mock/factory pattern exists in this test file for creating
# a fake attachment with a specific filename, then run the mail handler,
# and assert that document.original_filename == nfc_name (not nfd_name).
# Adapt the mock setup to match the test file's existing patterns exactly.
```
To find the right mock pattern: `grep -n "att.filename\|Attachment\|MailMessage\|MagicMock" src/paperless_mail/tests/test_mail.py | head -20`
- [ ] **Step 3: Run the test to verify it fails**
```bash
uv run pytest --override-ini="addopts=" src/paperless_mail/tests/test_mail.py -k "test_attachment_filename_nfd" -v
```
Expected: FAIL.
- [ ] **Step 4: Add `import unicodedata` to mail.py**
At the top of `src/paperless_mail/mail.py`, add:
```python
import unicodedata
```
- [ ] **Step 5: Normalize attachment filenames in mail.py**
At each of the three `pathvalidate.sanitize_filename` call sites found in Step 1, wrap the input string with `unicodedata.normalize("NFC", ...)`:
For the attachment temp file creation:
```python
attachment_name = pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", att.filename)
)
```
For the metadata override filename:
```python
filename=pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", att.filename)
),
```
For the EML subject filename:
```python
filename=pathvalidate.sanitize_filename(
unicodedata.normalize("NFC", f"{message.subject}.eml")
),
```
- [ ] **Step 6: Run the mail test suite**
```bash
uv run pytest --override-ini="addopts=" src/paperless_mail/tests/test_mail.py -v
```
Expected: all tests pass, including the new NFD normalization test.
- [ ] **Step 7: Commit**
```bash
git add src/paperless_mail/mail.py src/paperless_mail/tests/test_mail.py
git commit -m "Fix: normalize mail attachment filenames to NFC Unicode"
```
---
## Self-Review Checklist
### Spec coverage
| Requirement | Covered by |
| --------------------------------------------------------- | ----------------------------------------------------- |
| `clean_filepath()` normalizes all template-rendered paths | Task 1 Step 3 |
| `{{ title }}` (sanitized context) produces NFC output | Task 1 test + Task 2 Step 3 |
| `{{ document.title }}` (raw context) produces NFC output | Task 1 test |
| `{{ correspondent }}` produces NFC output | Task 1 test + Task 2 Step 3 |
| `{{ tag_list }}` and `tag_name_list` produce NFC output | Task 2 Steps 1+4 |
| Custom field string values produce NFC output | Task 2 Step 5 |
| Mail attachment filenames normalized at entry point | Task 3 |
| Existing NFD files auto-migrate to NFC on next save | Handled by existing move logic; no code change needed |
### Notes for implementer
- The `FILENAME_FORMAT` setting accepts old-style `{title}` format strings, which `convert_format_str_to_template_format()` converts to Jinja2 `{{ title }}` before rendering. Tests using `@override_settings(FILENAME_FORMAT="{{ title }}")` use Jinja2 syntax directly.
- Run tests with `--override-ini="addopts="` to disable coverage and parallelism for faster iteration.
- The `unicodedata` module is part of the Python standard library — no new dependency.
- NFC is the right normalization form for filenames: it is the default on macOS (HFS+/APFS) and the form most databases and text processing tools produce. NFD is what macOS HFS+ _internally_ normalizes to when writing (but presents as NFC), and what some OCR/LLM outputs occasionally produce.
@@ -0,0 +1,225 @@
# Scheduled Backup Design
**Date**: 2026-05-15
**Status**: Approved
## Overview
Add a scheduled backup system to paperless-ngx that exports documents as zip files on a user-configurable schedule, retaining the last N backups. The schedule timing is configured via an env var (consistent with all other scheduled tasks), while the backup-specific configuration (output directory, keep count) lives in a new database model editable through the API and UI.
## Goals
- Automated periodic exports without manual intervention
- Zip-based output for simple, unambiguous rotation
- Opt-in: no backup runs unless explicitly configured
- Strongly typed export contract usable by both the CLI and the scheduled task
- UI-editable backup config, no additional env vars beyond the cron schedule
## Non-Goals
- Encrypted backups (future enhancement)
- Age-based or size-based rotation (count-only for now)
- Remote/cloud backup destinations
- Import automation
---
## Section 1: Data Model and API
### `BackupConfiguration` model
New singleton model in `src/paperless/models.py`, following the same `AbstractSingletonModel` pattern as `ApplicationConfiguration`.
```python
class BackupConfiguration(AbstractSingletonModel):
output_dir = models.CharField(
verbose_name=_("Backup output directory"),
max_length=1024,
blank=True,
default="",
)
keep_count = models.PositiveIntegerField(
verbose_name=_("Number of backups to keep"),
default=5,
help_text=_("Set to 0 to keep all backups."),
)
class Meta:
verbose_name = _("Backup configuration")
```
- `output_dir` blank/empty means backup is disabled (the task treats it as a no-op).
- `output_dir` must be an absolute path. The serializer validates this via a custom validator; `run_export` also calls `.resolve()` on the path unconditionally.
- `keep_count = 0` means keep all backups; no rotation is performed.
### Migration
The migration is created in `src/paperless/migrations/` (not `src/documents/migrations/`), since `BackupConfiguration` lives in the `paperless` app.
### API
- **Serializer**: `BackupConfigurationSerializer` in `src/paperless/serialisers.py`
- **ViewSet**: `BackupConfigurationViewSet` in `src/paperless/views.py` — singleton GET/PATCH, same pattern as `ApplicationConfiguration`
- **Route**: `/api/backup_config/` registered in `src/paperless/urls.py`
---
## Section 2: Export Module
New module `src/documents/export.py` contains the export contract and core logic, extracted from `document_exporter`'s `handle()` method.
### `ExportOptions` dataclass
```python
@dataclass
class ExportOptions:
target: Path
compare_checksums: bool = False
compare_json: bool = False
delete: bool = False
use_filename_format: bool = False
no_archive: bool = False
no_thumbnail: bool = False
use_folder_prefix: bool = False
split_manifest: bool = False
zip_export: bool = False
zip_name: str | None = None # None -> default date-based name
data_only: bool = False
passphrase: str | None = None
batch_size: int = 500
```
`zip_name = None` means the caller wants the default date-based name. `run_export` resolves `None` internally to `f"export-{timezone.localdate().isoformat()}"` before use — callers never need to supply a default. The scheduled task always passes an explicit timestamped name.
### `run_export(options: ExportOptions) -> None`
The body of the current `Command.handle()` in `document_exporter` moves here, reading from `ExportOptions` instead of parsed CLI options. No behaviour changes.
### Refactored `document_exporter` management command
Becomes a thin CLI adapter:
1. Parse arguments (unchanged)
2. Construct `ExportOptions` from parsed args
3. Call `run_export(options)`
---
## Section 3: Scheduled Task and Rotation
### `scheduled_backup` task in `src/documents/tasks.py`
```
1. Load BackupConfiguration (singleton)
2. If output_dir is blank, log a debug message and return (no-op, no PaperlessTask created)
3. Create a PaperlessTask record (TriggerSource.SCHEDULED) to track this run
4. Build zip_name as local-time timestamp: "export-YYYY-MM-DD-HHMMSS"
using Django's timezone.localtime()
5. Construct ExportOptions(
target=Path(config.output_dir),
zip_export=True,
zip_name=zip_name,
)
6. Call run_export(options)
7. If keep_count > 0:
zips = sorted(Path(config.output_dir).glob("export-*.zip"), key=lambda p: p.stat().st_mtime)
for old_zip in zips[:-keep_count]:
old_zip.unlink()
8. Mark PaperlessTask as complete (handled by signal handlers)
```
Key design notes:
- Rotation uses `export-*.zip` glob, not `*.zip`, to avoid matching zip files in the directory that paperless did not create.
- Rotation occurs only after a successful export, so a failed run does not consume a rotation slot.
- The timestamp format `YYYY-MM-DD-HHMMSS` in local time ensures multiple runs per day produce distinct filenames without collision.
### PaperlessTask integration
`PaperlessTask` lifecycle is managed entirely by the Celery signal handlers in `src/documents/signals/handlers.py`, not manually inside the task body.
**Changes to `TRACKED_TASKS` and `PaperlessTask.TaskType`:**
- Add `PaperlessTask.TaskType.BACKUP` to the `TaskType` enum in `src/documents/models.py`
- Add `"documents.tasks.scheduled_backup": PaperlessTask.TaskType.BACKUP` to `TRACKED_TASKS`
**Conditional tracking — the no-op case:**
When `BackupConfiguration.output_dir` is blank the task returns immediately, so no record should appear in the Tasks panel. This requires explicit handling in all five signal handlers. Relying on incidental safety (filters that match 0 rows, `DoesNotExist` guards) is fragile and unclear to future maintainers.
The approach for each handler when the task type is `BACKUP`:
| Handler | Current behaviour when no record exists | Required change |
| ----------------------------- | ---------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------- |
| `before_task_publish_handler` | Creates the record | Check `BackupConfiguration.get_solo().output_dir`; skip `PaperlessTask.objects.create()` if blank |
| `task_prerun_handler` | `.filter().update()` — silent no-op | Add explicit early return if `BACKUP` task type and no record exists for `task_id` |
| `task_postrun_handler` | `DoesNotExist: return` — incidentally safe | Add explicit early return if `BACKUP` task type and no record exists for `task_id` |
| `task_failure_handler` | `.filter().first()` returns `None`, update skipped — incidentally safe | Add explicit early return if `BACKUP` task type and no record exists for `task_id` |
| `task_revoked_handler` | `.filter().update()` — silent no-op | Add explicit early return if `BACKUP` task type and no record exists for `task_id` |
Extract a helper `_backup_task_is_tracked(task_id: str) -> bool` that returns `PaperlessTask.objects.filter(task_id=task_id).exists()`. The four downstream handlers call this after the `TRACKED_TASKS` check and return early if it returns `False` for a `BACKUP` task. This makes the intent explicit: "this task was intentionally not tracked for this invocation."
---
## Section 4: Beat Schedule
Add to the task list in `parse_beat_schedule()` in `src/paperless/settings/custom.py`:
```python
{
"name": "Scheduled document backup",
"env_key": "PAPERLESS_EXPORT_TASK_CRON",
"env_default": "disable",
"task": "documents.tasks.scheduled_backup",
"options": {
"expires": 1.0 * 60.0 * 60.0, # 1 hour
},
},
```
- Default is `"disable"` — the task is not added to the beat schedule unless the env var is explicitly set.
- Setting `PAPERLESS_EXPORT_TASK_CRON=disable` (or simply not setting it) produces no scheduled task and no noise.
- Typical user value: `"0 2 * * *"` (daily at 02:00 local server time).
- `expires` is set to 1 hour: if a scheduled backup has not started within 1 hour of its trigger time (e.g., the Celery worker was down), it is discarded rather than running late. Unlike other tasks whose expiry is tied to a known default interval, this task has a user-defined schedule. 1 hour is a conservative value that prevents stale backup tasks from piling up without being so short that it causes problems on a normally-running worker.
---
## Section 5: Frontend
Location to be decided by co-maintainer (dedicated "Backup" page vs. section within Application Settings). The API contract is independent of this decision.
The UI requires two fields:
- **Output directory** — text input for `output_dir` (absolute path on the server)
- **Keep count** — number input for `keep_count`, with a note that 0 means keep all
The component performs a GET to `/api/backup_config/` on load and a PATCH on save, identical to how the Application Settings component works.
---
## File Change Summary
| File | Change |
| -------------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| `src/paperless/models.py` | Add `BackupConfiguration` model |
| `src/paperless/serialisers.py` | Add `BackupConfigurationSerializer` |
| `src/paperless/views.py` | Add `BackupConfigurationViewSet` |
| `src/paperless/urls.py` | Register `/api/backup_config/` route |
| `src/paperless/settings/custom.py` | Add `PAPERLESS_EXPORT_TASK_CRON` beat entry |
| `src/documents/export.py` | New module: `ExportOptions`, `run_export()` |
| `src/documents/management/commands/document_exporter.py` | Thin wrapper around `run_export()` |
| `src/documents/models.py` | Add `PaperlessTask.TaskType.BACKUP` |
| `src/documents/signals/handlers.py` | Add `BACKUP` to `TRACKED_TASKS`; add `_backup_task_is_tracked()`; update all 5 signal handlers |
| `src/documents/tasks.py` | Add `scheduled_backup` task |
| `src-ui/` | New or extended settings component (location TBD) |
| `src/paperless/migrations/` | New migration for `BackupConfiguration` |
---
## Testing
- **`src/paperless/tests/test_backup_config.py`** — model, serializer, API (GET/PATCH)
- **`src/documents/tests/test_export.py`** — new unit tests for `run_export()` directly; `test_management_exporter.py` retains its existing CLI wiring tests and gains tests for the thin-wrapper behaviour
- **`src/documents/tests/test_tasks_backup.py`** — `scheduled_backup` task: no-op when `output_dir` blank, export called with correct options, rotation deletes correct files, rotation skipped when `keep_count=0`
- **`src/documents/tests/test_task_signals.py`** — signal handler behaviour for `BACKUP` task type: no record created when `output_dir` blank, all downstream handlers skip cleanly when no record exists, normal lifecycle when `output_dir` is set
- Frontend unit tests for the settings component
@@ -0,0 +1,81 @@
# Interactive Shell Container Environment
**Date:** 2026-05-26
**Branch:** fix-tanvity-index-lock (to be implemented on a new branch)
**Status:** Approved
## Problem
When paperless-ngx users open an interactive shell in the running container via `docker exec -it <container> bash`, they do not see environment variables resolved from `*_FILE` secret injection.
The `init-env-file` s6 init script reads `PAPERLESS_*_FILE` variables (e.g. `PAPERLESS_SECRET_KEY_FILE=/run/secrets/key`), reads the referenced file, and writes the resolved value (e.g. `PAPERLESS_SECRET_KEY=abc123`) to `/run/s6/container_environment/`. All s6-managed services and management command wrappers use the `#!/command/with-contenv` shebang, which reads that directory and injects all vars into the process environment before execution.
`docker exec bash` bypasses s6 entirely. It is a non-login interactive shell launched directly by the Docker daemon, which provides only the original Docker-configured environment (the `*_FILE` paths, not the resolved values). Any manual command a user runs — such as `document_exporter` or `manage.py` calls — will be missing the resolved secrets unless they happen to also be set as plain Docker env vars.
## Approach
Source `/run/s6/container_environment/` in every interactive bash shell opened in the container, mirroring what `with-contenv` does for s6 services.
Two hooks are needed because Debian uses different rc files for different shell types:
- **Non-login interactive** (`docker exec bash`): sources `/etc/bash.bashrc`
- **Login interactive** (`docker exec bash --login`): sources `/etc/profile`, which auto-sources all `/etc/profile.d/*.sh`
## Changes
### 1. `docker/rootfs/etc/profile.d/contenv.sh` (new file)
A POSIX-compatible shell script that exports all files in `/run/s6/container_environment/` as environment variables. Placed here so login shells pick it up automatically.
```sh
#!/bin/sh
# Source s6 container environment for interactive shells.
# Ensures variables resolved from *_FILE secret injection are visible
# when using 'docker exec bash'. Does not affect s6 services (those
# use with-contenv directly). Has no effect in non-container contexts
# because the directory will not exist.
# Note: sh/dash shells opened via 'docker exec sh' are not covered;
# only bash-based sessions benefit from this file.
_pngx_contenv="/run/s6/container_environment"
if [ -d "${_pngx_contenv}" ]; then
for _pngx_f in "${_pngx_contenv}"/*; do
[ -f "${_pngx_f}" ] || continue
_pngx_name=$(basename "${_pngx_f}")
_pngx_val=$(cat "${_pngx_f}")
export "${_pngx_name}=${_pngx_val}"
done
fi
unset _pngx_contenv _pngx_f _pngx_name _pngx_val
```
### 2. Dockerfile `main-app` stage (one line added)
Appends a source line to `/etc/bash.bashrc` so non-login interactive shells also pick up contenv. Added after the runtime package installation block, before the Python dependency installation.
```dockerfile
RUN echo '. /etc/profile.d/contenv.sh' >> /etc/bash.bashrc
```
`/etc/bash.bashrc` is provided by the Debian base image and installed during the apt step, so it exists by the time this `RUN` executes.
## Coverage
| How user gets a shell | Gets contenv? | Mechanism |
| ---------------------------------------- | --------------------- | ---------------------------------------- |
| `docker exec -it container bash` | Yes | `/etc/bash.bashrc` sources `contenv.sh` |
| `docker exec -it container bash --login` | Yes | `/etc/profile.d/contenv.sh` auto-sourced |
| `docker exec -it container sh` | No (known limitation) | `sh` sources neither file |
| Management command wrappers | Already worked | `with-contenv` shebang |
| s6 services | Already worked | `with-contenv` shebang |
## Edge Cases
**Shell opened before `init-env-file` completes:** The directory exists but may not yet contain all resolved vars. The script exports what is present; missing vars are simply absent. No error is produced.
**Variable value contains special characters:** `$(cat file)` strips only trailing newlines (which `init-env-file` already warns about). Other special characters are preserved correctly by the `export "NAME=VALUE"` form.
**Directory does not exist (non-container use):** The `[ -d ]` guard makes the script a no-op. Safe to include in any Debian-based image.
## Testing
No automated test is added. This is container-bootstrap shell plumbing with no Python code path. Manual verification: run the container with a `*_FILE` secret, `docker exec bash`, and confirm the resolved variable is present in the environment.
@@ -0,0 +1,138 @@
# LLM Index Schema Migrations (second spec)
Date: 2026-06-10
Depends on: `docs/superpowers/specs/2026-06-10-sqlite-vec-vector-store-design.md` and its implementation plan (`docs/superpowers/plans/2026-06-10-sqlite-vec-transition.md`). This spec layers on top of the completed sqlite-vec transition; do not start it before that branch lands.
Supersedes: PR #12968 (in-place LanceDB migrations). The machinery design there is carried over nearly verbatim; only the storage backend specifics change. #12968 should be closed with a pointer here once this ships.
Scope update (user decision, 2026-06-10): the `embedding.py:115` metadata restructure originally drafted as Part 2 of this spec was folded into the transition plan instead (its Task 5), because the transition forces a full rebuild anyway, so the embedded-text change rides along with no extra re-embed cost. This spec is now machinery-only: it ships with an EMPTY migration registry, ready for whatever schema change comes next. Part 2 below is retained as the worked example of how a re-embed migration would be registered, since the next one will not have a free rebuild to piggyback on.
## Part 1: Schema migration machinery (ported from PR #12968)
### What carries over unchanged
The PR's design survives the store swap intact and is adopted as-is:
- `Migration` frozen dataclass: `version: int`, `description: str`, `requires_reembed: bool`, `apply: Callable` (compare/hash-excluded field).
- `MIGRATIONS: list[Migration]` ordered registry + `CURRENT_SCHEMA_VERSION: Final[int]` in `vector_store.py`. To add a migration: bump the constant, append an entry.
- Store surface: `stored_schema_version() -> int` (0 when unrecorded, so pre-versioning tables treat every migration as pending), `pending_migrations()`, `requires_reembed_migration()`, `apply_structural_migrations() -> list[Migration]`.
- The stop-at-first-reembed-boundary rule in `apply_structural_migrations()`: structural migrations are applied in version order only up to the first pending `requires_reembed=True` entry, so the version counter can never jump past a re-embed boundary and silently skip the rebuild. (This was the subtle correctness insight of #12968; preserve the comment.)
- The `update_llm_index()` hook, verbatim from the PR:
```python
with write_store(embed_model_name=model_name) as store:
if not rebuild and store.table_exists():
store.apply_structural_migrations()
if store.requires_reembed_migration():
logger.warning(
"Schema migration requires re-embedding; forcing LLM index rebuild.",
)
rebuild = True
```
- Test approach from the PR: mock `MIGRATIONS`/`CURRENT_SCHEMA_VERSION` with `mocker.patch`, spy on `drop_table` to distinguish in-place from rebuild, one test per path (structural applied without rebuild; pending re-embed forces rebuild).
### What changes for sqlite-vec
**1. Version storage: `index_meta['schema_version']` instead of `schema_version.json`.**
The Lance store needed a sidecar JSON file because Lance had no convenient mutable metadata. The sqlite-vec store already has the `index_meta` key/value table, which is transactional with the data itself (a migration and its version bump commit atomically, which the file never could). Concretely:
- `_create_table(dim)` additionally writes `schema_version = str(CURRENT_SCHEMA_VERSION)` (fresh tables are always current).
- `stored_schema_version()` reads the meta key, returns 0 on absence/garbage.
- `drop_table()` already does `DELETE FROM index_meta`, which clears the version with it. No sidecar file, no unlink bookkeeping.
- `apply_structural_migrations()` writes the new version inside the same transaction as the last applied migration.
**2. `apply` receives the store, not a table handle.**
Lance migrations got the raw table for `add_columns`/`alter_columns`. vec0 virtual tables do not support arbitrary `ALTER TABLE`, so structural migrations are SQL against the store's connection. Signature: `apply: Callable[[PaperlessSqliteVecVectorStore], None]`. The store exposes what migrations need: `.client` (connection), `._table_name`, `.vector_dim()`, and the rebuild helper below.
**3. Structural migrations are create+copy+rename, sharing the compact() machinery.**
The sqlite-vec `compact()` already implements the only structural mutation vec0 supports: build a new table, `INSERT INTO ... SELECT` (vectors copied bit-for-bit, no re-embedding), drop old, rename. Factor it into a shared helper on the store:
```python
def rebuild_table(
self,
*,
create_sql: str | None = None,
copy_select: str | None = None,
) -> None:
"""Copy live rows into a freshly created table and swap it in.
Defaults reproduce the current schema (compaction). Structural
migrations pass a modified CREATE statement and a matching SELECT
(e.g. adding a column with a literal default). Runs in one
transaction; VACUUM afterwards.
"""
```
`compact()` becomes a thin caller (threshold check + `rebuild_table()`), and a structural migration like "add a `+page_count` aux column" is:
```python
Migration(
version=2,
description="add page_count auxiliary column",
requires_reembed=False,
apply=lambda store: store.rebuild_table(
create_sql=..., # CREATE VIRTUAL TABLE ... with the new column
copy_select="SELECT id, document_id, modified, node_content, embedding, '' FROM {old}",
),
)
```
A pleasant consequence: every structural migration is also a compaction (the copy drops dead rows), and the file-format risk surface is one helper with one test suite instead of two code paths.
**4. Bootstrap version for the sqlite-vec store is 1.**
The transition plan ships the new store without machinery; tables it creates carry no `schema_version` key and therefore read as 0. This release lands with `CURRENT_SCHEMA_VERSION = 1` and `MIGRATIONS = []`, so the bootstrap is unconditionally safe: a 0-version table has no pending migrations and `apply_structural_migrations()` simply stamps it to 1. (The metadata restructure having moved into the transition itself is what makes this clean; the registry's first real entry will be v2, written against tables that are all stamped.)
## Part 2 (worked example, IMPLEMENTED IN THE TRANSITION): the metadata TODO as a re-embed migration
This section was implemented as Task 5 of the transition plan and ships with the store swap, not with this spec. It is kept as the reference example of how to register the next re-embed migration.
### The change
`build_llm_index_text()` currently embeds three short structured values in the body text:
```python
f"Filename: {doc.filename}",
f"Storage Path: {doc.storage_path.name if doc.storage_path else ''}",
f"Archive Serial Number: {doc.archive_serial_number or ''}",
```
Per the TODO, move them to `node.metadata` (excluded from embeddings, visible to the LLM via llama-index's metadata prepend), the same treatment title/tags/correspondent/document_type got in PR #12944. Notes and Custom Fields stay in the body (long free text / dynamic count, as the TODO says).
1. `embedding.py build_llm_index_text()`: delete the three lines above (the `lines` list keeps Notes, Custom Fields, and Content). Update the TODO comment to describe only what remains intentional (Notes/Custom Fields stay embedded), or delete it.
2. `indexing.py build_document_node()` metadata dict gains:
```python
"filename": doc.filename,
"storage_path": document.storage_path.name if document.storage_path else None,
"archive_serial_number": document.archive_serial_number,
```
(`None`/int values are fine here: this dict lives in the node-content JSON, not in vec0 metadata columns; only `document_id`/`modified` are columns with the NULL restriction. Matches the existing convention of `correspondent: None`.) 3. `excluded_embed_metadata_keys=list(metadata.keys())` already covers the new keys; `excluded_llm_metadata_keys` stays `["document_id"]` so the LLM sees the new fields.
### Why this class of change needs a migration
Removing the three lines changes the embedded text of every document, so stored vectors no longer match what the current code would embed. Incremental updates only re-embed documents whose `modified` changed, so without a forced rebuild the index would be a mixed old/new-text population indefinitely. This particular change escaped that fate only because the transition's forced rebuild covers it. The next embedded-text change will not have that luxury and gets registered like this:
```python
CURRENT_SCHEMA_VERSION: Final[int] = 2
MIGRATIONS: list[Migration] = [
Migration(
version=2,
description="<what changed about the embedded text>",
requires_reembed=True,
apply=lambda store: None,
),
]
```
On the first `update_llm_index` after upgrade, the hook sees the pending re-embed migration, logs, and rebuilds.
### Test plan
Machinery only (the metadata change is tested in the transition plan's Task 5). Port of the #12968 tests, dedicated file `test_vector_store_migrations.py`: structural migration applies in-place without `drop_table`; pending re-embed forces rebuild; version stamping on create/drop; bootstrap stamping of a pre-machinery 0-version table to 1; stop-at-boundary with a mixed [structural v2, reembed v3, structural v4] registry asserting v4 is NOT applied and the stored version stays at 2; `rebuild_table()` round-trips rows byte-for-byte (shared with compact tests).
### Open questions
- PR #12968 disposition: close with a comment pointing at this spec once the machinery lands (the Lance-specific `add_columns` path has no successor; vec0 cannot do in-place column adds).
- `created`/`added` fields are also candidates for future structural metadata work, but nothing needs them now (YAGNI; noted only so the next reader does not re-derive it).
@@ -0,0 +1,155 @@
# sqlite-vec Vector Store Design (replaces PaperlessLanceVectorStore)
Date: 2026-06-10
Context: LanceDB wheels SIGILL on non-AVX2 CPUs (#12970); research in `2026-06-10-vector-store-alternatives-research.md` selected sqlite-vec. This is a beta feature, so a one-time re-embed on upgrade is acceptable. Every claim marked [VERIFIED] below was empirically tested against the actual PyPI wheel (0.1.9, and 0.1.10a4 where noted), either in this repo's scratch harness (`/tmp/vstore-avx-test/explore_sqlitevec*.py`) or by the issues-audit agent.
## Version pin: `sqlite-vec==0.1.9`, and why it is load-bearing
- The 0.1.9 linux x86_64 wheel is built with **no SIMD flags at all** (`vec_debug()` shows empty build flags) and passed our qemu Westmere (SSE4.2, no AVX) and SandyBridge (AVX, no AVX2) emulation tests [VERIFIED]. This is the entire point of the migration.
- The **0.1.10-alpha.4 wheel regresses this**: built with `-mavx -DSQLITE_VEC_ENABLE_AVX` file-wide, no runtime CPU dispatch. It can SIGILL on AVX-less CPUs, including Goldmont Atom/Celeron NAS boxes, exactly the #12970 user base [VERIFIED via vec_debug on the wheel].
- Guardrails: pin `==0.1.9` exactly; log `SELECT vec_version(), vec_debug()` at store init as an AVX canary; before ever bumping to 0.1.10+, re-check the wheel flags (and consider raising the runtime-dispatch issue upstream first).
- arm64: 0.1.9 manylinux aarch64 wheel is a proper ELF64 binary, no NEON flags baked [VERIFIED]. (The broken 32-bit "aarch64" wheel era was 0.1.6, fixed since.)
- No sdist on PyPI (asg017/sqlite-vec#211, open) and no musl wheels; fine for our Debian-based image, blocks Alpine bare-metal installs.
## Schema
One dedicated SQLite database file in `LLM_INDEX_DIR` (e.g. `llmindex.db`), never the Django DB. Connections set `PRAGMA journal_mode=WAL`, `busy_timeout`, `synchronous=NORMAL`.
```sql
CREATE VIRTUAL TABLE nodes USING vec0(
id TEXT PRIMARY KEY, -- node_id (uuid)
document_id TEXT, -- METADATA column, deliberately NOT a partition key
modified TEXT, -- ISO timestamp; never NULL (sentinel "")
+node_content TEXT, -- auxiliary column: JSON payload, any size
embedding float[{dim}] distance_metric=cosine
);
CREATE TABLE IF NOT EXISTS index_meta (key TEXT PRIMARY KEY, value TEXT);
-- rows: embed_model, dim, schema_version, created_by_vec_version
```
Design decisions, each verified on 0.1.9:
- **`document_id` is a metadata column, not a partition key.** With a partition key, `k` applies per partition: `k=5 AND document_id IN (3 docs)` returns 15 rows (asg017/sqlite-vec#142, open) [VERIFIED]. As a metadata column the same query returns a correct global top-k of exactly 5 [VERIFIED]. `query_similar_documents()` passes permission-scoped `IN` lists, so per-partition semantics would over-fetch k x N(docs). At our scale the partition-pruning speedup is not needed (filtered KNN at 20K x 1024 was _faster_ than unfiltered: 39 ms vs 74 ms).
- **One document column, not two.** The Lance store carried both `doc_id` (ref_doc_id) and `document_id`; in our usage they are always the same value (`str(document.id)`), so the new schema keeps only `document_id`.
- **TEXT primary key works** (insert, UPDATE, DELETE, duplicate rejection) [VERIFIED]. There is no usable rowid mapping with a TEXT pk, which we do not need.
- **Aux column for the payload.** `+node_content` holds the multi-KB JSON; aux columns cannot appear in KNN WHERE clauses (loud error, not silent) [VERIFIED], which we never do, and are selectable in scans and KNN results [VERIFIED].
- **Metadata columns reject NULL** (asg017/sqlite-vec#141, open) [VERIFIED]. `_row()` must keep coercing everything through `str(... or "")` as it already does today.
- **`distance_metric=cosine`**: similarity maps as `1 - distance` (identical vector gives distance 0.0 [VERIFIED]). For unit-norm embeddings the ranking equals today's L2 ranking; for non-normalized models cosine is the safer default, and the beta re-embed makes the behavior change free. (L2 + `1/(1+d)` remains available if exact parity is ever wanted.)
- **Vectors are always bound as float32 BLOBs** (`struct.pack`/`np.tobytes`), never JSON text: bypasses the locale-dependent `strtod` parsing bug (asg017/sqlite-vec#241, open) entirely.
- Limits, all comfortable: dims <= 8192, k <= 4096, chunk_size default 1024 [VERIFIED]. TEXT metadata has no length cap; values > 12 bytes go to a shadow text table with a prefix fast-path, and the one historical bug at that boundary (long-metadata DELETE, #274) is fixed in 0.1.9.
## Method mapping (PaperlessLanceVectorStore -> PaperlessSqliteVecVectorStore)
| Current method | sqlite-vec implementation | Notes |
| --------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `__init__(uri, table_name, embed_model_name)` | `sqlite3.connect(path)` + `enable_load_extension` + `sqlite_vec.load()` + PRAGMAs | Same lazy "table may not exist yet" stance |
| `client` property | the `sqlite3.Connection` | |
| `table_exists()` | `SELECT 1 FROM sqlite_master WHERE name='nodes'` | |
| `vector_dim()` | `index_meta['dim']` | Written at table creation; wrong-dim inserts are rejected by vec0 anyway [VERIFIED] |
| `drop_table()` | `DROP TABLE nodes` | Drops all 7 shadow tables with it [VERIFIED]; also clear `index_meta` |
| `stored_model_name()` / `config_mismatch()` | `index_meta['embed_model']` | Same conservative None handling |
| `_schema(dim, model)` | the CREATE statements above | dim from first batch, as today (`_ensure_table`) |
| `_row(node)` | same dict, vector packed to bytes | keep `str(... or "")` coercion (NULL rejection) |
| `add(nodes)` | `executemany(INSERT ...)` inside one transaction | ~3,300 rows/s at 1024 dims measured; batching via transactions |
| `upsert_document(document_id, nodes)` | `BEGIN; DELETE FROM nodes WHERE document_id = ?; executemany(INSERT); COMMIT` | **Not** `INSERT OR REPLACE`: broken on vec0 (asg017/sqlite-vec#259, open). Transaction gives the same no-transient-empty-state guarantee as merge_insert; rollback verified [VERIFIED] |
| `delete(ref_doc_id)` | `DELETE FROM nodes WHERE document_id = ?` | |
| `get_nodes(filters)` | `SELECT id, document_id, node_content, embedding FROM nodes [WHERE ...]` | full scans on vec0 work [VERIFIED]; 45 ms / 20K rows |
| `query(VectorStoreQuery)` | `SELECT id, node_content, embedding, distance FROM nodes WHERE embedding MATCH ? AND k = ? [AND filters]` then Python-slice to `top_k` | `k = ?` is mandatory; `LIMIT` cannot be combined with `k` [VERIFIED]; results arrive distance-sorted [VERIFIED]; similarities = `1 - distance` |
| `_build_where(filters)` | same EQ/IN translation, but emitting `?` placeholders + params list | **Upgrade**: bound parameters replace today's manual `_escape()` string interpolation |
| `get_modified_times()` | `SELECT document_id, modified FROM nodes` + first-seen dedupe in Python | identical logic |
| `ensure_document_id_scalar_index()` | no-op (delete if nothing else needs it) | metadata filters are evaluated in the chunk scan; nothing to create |
| `maybe_create_ann_index()` | no-op on 0.1.9 | ANN (rescore/diskann) is 0.1.10-alpha territory; adopting an ANN index makes the file unreadable by 0.1.9 (one-way door), while flat tables round-trip 0.1.9 <-> 0.1.10a4 cleanly [VERIFIED]. Revisit post-0.1.10-final |
| `compact(retention_seconds)` | **rebuild-based compaction**, see below | replaces Lance MVCC cleanup |
Filter constraint surface (loud errors otherwise, [VERIFIED]): only `=, !=, <, <=, >, >=, IN` on metadata columns in KNN queries. We use only EQ/IN. Never use `NOT IN` (the vtab cannot see it; SQLite post-filters and silently under-delivers below k, asg017/sqlite-vec#116).
## Compaction: the one real behavioral difference
vec0 DELETE only flips a validity bit; space is never reclaimed, and VACUUM recovers only about half (asg017/sqlite-vec#54, #220, open; fix PRs #243/#210 unmerged). Measured: 5 delete+reinsert cycles on 2K rows grew the file 3.32 MB -> 6.56 MB; VACUUM got back to 4.94 MB. Paperless's per-document churn (every document edit is a delete+reinsert) hits this directly.
So `compact()` becomes the maintainer-endorsed rebuild (asg017/sqlite-vec#205):
```sql
CREATE VIRTUAL TABLE nodes_new USING vec0(...);
INSERT INTO nodes_new SELECT id, document_id, modified, node_content, embedding FROM nodes;
DROP TABLE nodes;
ALTER TABLE nodes_new RENAME TO nodes; -- then VACUUM
```
This copies vectors without re-embedding, runs under the existing write FileLock, and slots into the existing `document_llmindex compact` command and the scheduled maintenance task. A cheap trigger heuristic: rebuild when `count(*) in nodes_rowids shadow` (cumulative) exceeds ~2x live rows, or just keep the existing scheduled cadence.
## Concurrency
vec0 is a plain vtab over ordinary shadow tables, so standard SQLite WAL semantics apply, and the existing architecture is already the textbook arrangement: writers serialized by `settings.LLM_INDEX_LOCK` FileLock, readers concurrent via WAL. Verified across processes: a reader during another process's open write transaction does not block and sees a consistent pre-transaction snapshot; post-commit it sees the new rows [VERIFIED]. No sqlite-vec-specific multi-process corruption, locking, or segfault reports exist in the tracker. The 0.1.10a4 cached-statement fix (#295) is a Firefox/mozStorage `sqlite3_close()` issue; CPython's `sqlite3` is unaffected, no Python-side reports.
Same caveat as the main SQLite DB: `LLM_INDEX_DIR` should not be on NFS.
## Performance expectations (measured on the 0.1.9 no-SIMD wheel)
- KNN 20K rows x 1024 dims: ~74 ms plain, ~39 ms with a metadata EQ filter.
- 100K x 768: 185 ms/query (vs 497 ms for LanceDB exact search on identical data).
- Extrapolated 500K x 1024-1536: ~0.9-1.8 s/query; 384 dims roughly 4x faster. Acceptable for suggestions/chat at the extreme tail; typical installs (low tens of thousands of chunks) are tens of ms.
- Insert: ~3,300 rows/s at 1024 dims in a single transaction.
- File size: ~raw vector size (~4.3 KB/row at 1024 dims), no compression; plus the bloat behavior above.
## Migration from the Lance store
Beta policy: re-embed. On startup/first index task: if `LLM_INDEX_DIR` contains a Lance table but no `llmindex.db`, log and queue a full rebuild, then remove the Lance directory. No cross-store vector copy, no lancedb import anywhere in the path (which is what un-breaks #12970 hosts: they currently crash at import, have no usable index, and get a fresh build).
PR #12968's migration machinery maps onto `index_meta['schema_version']`: structural migrations = create-new-table + `INSERT ... SELECT` + rename (vectors copied, no re-embed; same shape as the compaction rebuild); re-embed migrations = drop + full rebuild, jumping straight to the current version.
## Dependency changes
- Add: `sqlite-vec==0.1.9` (one ~100 KB platform wheel, zero Python deps).
- Remove: `lancedb~=0.33.0` (and its pylance/lancedb wheels, ~40 MB). `pyarrow` leaves this module; check whether anything else in the AI stack still needs it before dropping from pyproject.
## Test plan notes
- pytest-style per project convention; the store tests can run against a tmp_path DB file (or `:memory:` for pure-logic tests; extension loading works on uv-managed CPython [VERIFIED]).
- Port the existing `test_vector_store.py` surface; add dedicated tests for: upsert transactionality (no transient empty state mid-upsert from a second connection), NULL-coercion in `_row()`, k-slice behavior, EQ/IN filter correctness, compaction rebuild preserving rows byte-for-byte, vec_debug canary logging.
- The qemu matrix (`/tmp/vstore-avx-test/`) can be re-run against any future sqlite-vec bump: `qemu-x86_64 -cpu Westmere venv/bin/python candidate_test.py sqlite_vec <dir>`.
## Benchmark harness
`src/bench_vector_store.py` -- standalone head-to-head comparison run during the migration window when both `PaperlessLanceVectorStore` and `PaperlessSqliteVecVectorStore` coexist (Task 3 Phase A of the implementation plan). After Phase B replaces `vector_store.py`, the Lance import fails gracefully and only the sqlite-vec half runs (useful for post-migration baseline checks).
```bash
cd src
uv run python bench_vector_store.py # auto-generates bench_data.pkl on first run
uv run python bench_vector_store.py --regenerate # force re-embed
```
**Phase 1 (data generation, skipped if `bench_data.pkl` exists):** Faker generates `--n-docs` (default 2000) fake documents -- title, body, correspondent, ISO timestamp. Each body is split into `--chunks-per-doc` (default 3) equal-length chunks (~6000 total nodes). A warm-up embed call fires before generation to ensure the model is resident in GPU. All chunk texts are embedded via Ollama `/api/embed` in batches of 32 and saved to `bench_data.pkl`. Faker seed 42 for reproducibility.
**Phase 2 (benchmark):** Each store runs in an isolated `tempfile.TemporaryDirectory()`. Query vectors are drawn reproducibly from the corpus (every 10th node, wrapping).
| Operation | Reps | Metric |
| ----------------------------------------- | ---- | --------------------- |
| `add()` bulk insert | 1 | total time |
| `query()` plain | 50 | p50 / p95 |
| `query()` filtered (IN on 20% of doc IDs) | 50 | p50 / p95 |
| `get_modified_times()` | 20 | p50 |
| `upsert_document()` | 50 | p50 / p95 |
| `compact()` | 1 | total time |
| File size | -- | pre- and post-compact |
**CLI flags:** `--n-docs` (2000), `--chunks-per-doc` (3), `--data-file` (`bench_data.pkl`), `--regenerate`, `--ollama-url` (`http://192.168.1.87:11434`), `--embed-model` (`qwen3-embedding:4b`), `--query-iters` (50).
**Dependencies:** `faker` and `httpx` must be available (`uv add --dev faker httpx` if not already installed).
## Risk register (from the 2026-06-10 issues audit)
| Risk | Ref | State | Disposition |
| ------------------------------------------- | --------------------------------------- | -------------- | ------------------------------------------------------------------------------------------------------------------------------------------------- |
| 0.1.10+ wheels bake AVX, no dispatch | release CI change, verified on 0.1.10a4 | current | Pin 0.1.9; vec_debug canary; upstream ask before any bump |
| DELETE never reclaims space; VACUUM ~50% | #54, #220 | open | Rebuild-based `compact()` above |
| INSERT OR REPLACE broken on vec0 | #259 | open | Use DELETE+INSERT in txn (design already does) |
| NULL metadata rejected | #141 | open | Sentinel `""` coercion (already current behavior) |
| Partition-key IN returns k per partition | #142 | open | Avoided: document_id is a metadata column |
| NOT IN silently under-delivers | #116 | open | Never emit NOT IN |
| Locale strtod breaks JSON vector parsing | #241 | open | Always BLOB-bind vectors |
| Single weekend maintainer; fix PRs languish | #226 | open | Mitigated by Mozilla sponsorship + Firefox vendoring (release-train consumer); pin + vendor-from-source remains the escape hatch (no sdist: #211) |
| ANN index = one-way file format | 0.1.10 alphas | — | Do not adopt ANN until 0.1.10 final + flag audit |
| Long-TEXT metadata DELETE bug | #274 | fixed in 0.1.9 | Floor requirement `>=0.1.9` already implied by pin |