From b47cc92b290978fa0418082bf74aa82b9bee8b8a Mon Sep 17 00:00:00 2001 From: Trenton Holmes <797416+stumpylog@users.noreply.github.com> Date: Mon, 15 Jun 2026 10:54:37 -0700 Subject: [PATCH] More done work --- .../2026-06-11-unicode-nfc-normalization.md | 0 .../2026-06-14-search-query-translation.md | 1142 +++++++++++++++++ ...6-06-14-search-query-translation-design.md | 407 ++++++ ...026-06-14-search-phase2-tracking-prompt.md | 41 + 4 files changed, 1590 insertions(+) rename docs/superpowers/{ => done}/plans/2026-06-11-unicode-nfc-normalization.md (100%) create mode 100644 docs/superpowers/done/plans/2026-06-14-search-query-translation.md create mode 100644 docs/superpowers/done/specs/2026-06-14-search-query-translation-design.md create mode 100644 docs/superpowers/specs/2026-06-14-search-phase2-tracking-prompt.md diff --git a/docs/superpowers/plans/2026-06-11-unicode-nfc-normalization.md b/docs/superpowers/done/plans/2026-06-11-unicode-nfc-normalization.md similarity index 100% rename from docs/superpowers/plans/2026-06-11-unicode-nfc-normalization.md rename to docs/superpowers/done/plans/2026-06-11-unicode-nfc-normalization.md diff --git a/docs/superpowers/done/plans/2026-06-14-search-query-translation.md b/docs/superpowers/done/plans/2026-06-14-search-query-translation.md new file mode 100644 index 000000000..846964801 --- /dev/null +++ b/docs/superpowers/done/plans/2026-06-14-search-query-translation.md @@ -0,0 +1,1142 @@ +# Whoosh→Tantivy Query Translation Layer — Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Replace the order-dependent regex stack in `search/_query.py` with a structural, context-aware translation layer that eliminates the date- and comma-class HTTP 400s while matching verified Whoosh v2 semantics. + +**Architecture:** A flat, depth-aware scanner tokenizes the raw query into `FieldValue` / `FieldValueList` / `FieldRange` / `Comma` / `Passthrough` tokens; a single `translate_date_value` shape-dispatch converts date tokens to RFC3339 Tantivy ranges; comma tokens resolve to AND / value-list / literal; everything else passes through to Tantivy untouched. Output is a string handed to `index.parse_query` (Phase 1). Pure date math lives in a new dependency-free `_dates.py`. + +**Tech Stack:** Python 3.11+, `regex` module, Tantivy 0.26.0 (`tantivy-py`), Django, pytest (`-m search` marker). + +**Reference spec:** `docs/superpowers/specs/2026-06-14-search-query-translation-design.md`. Empirical ground truth: `SEARCH_TANTIVY_WHOOSH_COMPAT.md`. + +**Conventions:** ruff (line length 88, double quotes, single-line isort). Run focused tests with `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" -v`. The `--override-ini` drops coverage/xdist for fast iteration. + +--- + +## File Structure + +- **Create `src/documents/search/_dates.py`** — pure date-boundary helpers (moved from `_query.py`) + new partial-date precision helpers. No Django/app imports. +- **Create `src/documents/search/_translate.py`** — token dataclasses, `scan()`, `_resolve_commas()`, `translate_date_value()`, `translate_query()`. +- **Modify `src/documents/search/_query.py`** — re-export helpers from `_dates`; make `rewrite_natural_date_keywords`/`normalize_query` delegate; route `parse_user_query` through `translate_query` with a safety net. +- **Create `src/documents/tests/search/test_translate.py`** — all new unit + golden + parse-acceptance tests. +- **Modify `src/documents/tests/search/test_api_search.py`** (or nearest existing search-API test module) — end-to-end 400→200 tests. Confirm the exact module name first (see Task 8). + +--- + +## Task 1: Extract date helpers into `_dates.py` (no behavior change) + +**Files:** + +- Create: `src/documents/search/_dates.py` +- Modify: `src/documents/search/_query.py` (remove the moved helpers, add re-export) +- Test: existing `src/documents/tests/search/test_query.py` must stay green. + +- [ ] **Step 1: Create `_dates.py` by moving the helpers verbatim** + +Move these symbols out of `_query.py` into a new `src/documents/search/_dates.py`, unchanged: the constants `_DATE_ONLY_FIELDS`, the eight `_*` keyword constants, `_DATE_KEYWORDS`, `_DATE_KEYWORD_PATTERN`, and the functions `_fmt`, `_iso_range`, `_date_only_range`, `_datetime_range`. Header of the new file: + +```python +from __future__ import annotations + +from datetime import UTC +from datetime import date +from datetime import datetime +from datetime import timedelta +from typing import TYPE_CHECKING +from typing import Final + +from dateutil.relativedelta import relativedelta + +if TYPE_CHECKING: + from datetime import tzinfo + +# ... (moved constants and functions verbatim) ... +``` + +- [ ] **Step 2: Re-export from `_query.py` for backward-compatible imports** + +At the top of `_query.py`, after the existing imports, add: + +```python +from documents.search._dates import _DATE_ONLY_FIELDS +from documents.search._dates import _date_only_range +from documents.search._dates import _datetime_range +from documents.search._dates import _fmt +from documents.search._dates import _iso_range +``` + +Keep `_query.py`'s remaining regex rewriters as-is for now (they still reference these names, now imported). Remove the now-duplicated definitions from `_query.py`. + +- [ ] **Step 3: Run the existing search tests to confirm no behavior change** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_query.py -v` +Expected: PASS (same set as before; `TestCreatedDateField`/`TestDateTimeFields` import `_date_only_range`/`_datetime_range` from `_query` and still resolve via re-export). + +- [ ] **Step 4: Commit** + +```bash +git add src/documents/search/_dates.py src/documents/search/_query.py +git commit -m "Refactor: extract search date helpers into _dates.py" +``` + +--- + +## Task 2: Partial-date precision helpers in `_dates.py` + +**Files:** + +- Modify: `src/documents/search/_dates.py` +- Test: `src/documents/tests/search/test_translate.py` (new) + +- [ ] **Step 1: Write the failing test** + +Create `src/documents/tests/search/test_translate.py`: + +```python +import pytest + +from documents.search._dates import _precision_bounds + + +@pytest.mark.search +class TestPrecisionBounds: + @pytest.mark.parametrize( + ("digits", "expected"), + [ + ("2020", ((2020, 1, 1), (2021, 1, 1))), + ("202003", ((2020, 3, 1), (2020, 4, 1))), + ("202012", ((2020, 12, 1), (2021, 1, 1))), + ("20200115", ((2020, 1, 15), (2020, 1, 16))), + ("20201231", ((2020, 12, 31), (2021, 1, 1))), + ], + ) + def test_valid(self, digits, expected): + lo, hi = _precision_bounds(digits) + assert (lo.year, lo.month, lo.day) == expected[0] + assert (hi.year, hi.month, hi.day) == expected[1] + + @pytest.mark.parametrize("digits", ["202023", "20200230", "20201301", "20", "abcd"]) + def test_invalid_returns_none(self, digits): + assert _precision_bounds(digits) is None +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py -v` +Expected: FAIL with `ImportError: cannot import name '_precision_bounds'`. + +- [ ] **Step 3: Implement `_precision_bounds` in `_dates.py`** + +```python +def _precision_bounds(digits: str) -> tuple[date, date] | None: + """ + Map a 4/6/8-digit date token to (start, exclusive_end) calendar dates. + + YYYY -> whole year, YYYYMM -> whole month, YYYYMMDD -> single day. + Returns None for any unparsable or out-of-range value (e.g. month 23), + so callers can emit a no-match clause instead of erroring (Whoosh parity). + """ + try: + if len(digits) == 4: + year = int(digits) + return date(year, 1, 1), date(year + 1, 1, 1) + if len(digits) == 6: + year, month = int(digits[:4]), int(digits[4:6]) + start = date(year, month, 1) + end = date(year + 1, 1, 1) if month == 12 else date(year, month + 1, 1) + return start, end + if len(digits) == 8: + start = date(int(digits[:4]), int(digits[4:6]), int(digits[6:8])) + return start, start + timedelta(days=1) + except ValueError: + return None + return None +``` + +- [ ] **Step 4: Add `_field_range_from_dates` helper (used by the translator)** + +```python +def _field_range_from_dates(field: str, start: date, end: date, tz: tzinfo) -> str: + """ + Build a Tantivy ``field:[lo TO hi]`` ISO range from calendar-date bounds. + + For DateField (``created``) the bounds are UTC midnight (no offset). For + DateTimeField (``added``/``modified``) the bounds are local-tz midnight + converted to UTC, matching how each field is indexed. + """ + if field in _DATE_ONLY_FIELDS: + lo = datetime(start.year, start.month, start.day, tzinfo=UTC) + hi = datetime(end.year, end.month, end.day, tzinfo=UTC) + else: + lo = datetime(start.year, start.month, start.day, tzinfo=tz).astimezone(UTC) + hi = datetime(end.year, end.month, end.day, tzinfo=tz).astimezone(UTC) + return f"{field}:{_iso_range(lo, hi)}" +``` + +- [ ] **Step 5: Run to verify pass** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py -v` +Expected: PASS. + +- [ ] **Step 6: Commit** + +```bash +git add src/documents/search/_dates.py src/documents/tests/search/test_translate.py +git commit -m "Feature: partial-date precision helpers for query translation" +``` + +--- + +## Task 3: Token model + `scan()` for fields, ranges, passthrough + +**Files:** + +- Create: `src/documents/search/_translate.py` +- Test: `src/documents/tests/search/test_translate.py` + +This task handles `FieldValue`, `FieldRange`, and `Passthrough` only. Commas and value-lists come in Task 4. + +- [ ] **Step 1: Write the failing test** + +Append to `test_translate.py`: + +```python +from documents.search._translate import FieldRange +from documents.search._translate import FieldValue +from documents.search._translate import Passthrough +from documents.search._translate import scan + + +@pytest.mark.search +class TestScan: + def test_plain_words_are_passthrough(self): + assert scan("bank statement") == [Passthrough("bank statement")] + + def test_field_value(self): + assert scan("created:2020") == [FieldValue("created", "2020")] + + def test_field_value_in_boolean(self): + toks = scan("created:2020 OR foo") + assert toks == [ + FieldValue("created", "2020"), + Passthrough(" OR foo"), + ] + + def test_field_value_in_parens(self): + toks = scan("(created:2020 OR foo)") + assert toks == [ + Passthrough("("), + FieldValue("created", "2020"), + Passthrough(" OR foo)"), + ] + + def test_quoted_value(self): + assert scan('correspondent:"A B"') == [FieldValue("correspondent", '"A B"')] + + def test_field_range(self): + assert scan("created:[2020 TO 2021]") == [ + FieldRange("created", "[", "2020", "2021", "]"), + ] + + def test_open_range(self): + assert scan("created:[2020 to]") == [ + FieldRange("created", "[", "2020", "", "]"), + ] + assert scan("created:[to 2020]") == [ + FieldRange("created", "[", "", "2020", "]"), + ] + + def test_comma_inside_range_not_split(self): + # No depth-0 comma here; the whole thing is one range token. + toks = scan("created:[2020 TO 2021]") + assert len(toks) == 1 +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestScan -v` +Expected: FAIL with `ModuleNotFoundError`/`ImportError`. + +- [ ] **Step 3: Implement the token model and scanner core** + +Create `src/documents/search/_translate.py`: + +```python +from __future__ import annotations + +from dataclasses import dataclass +from dataclasses import field as dc_field +from typing import TYPE_CHECKING + +import regex + +if TYPE_CHECKING: + from datetime import tzinfo + +# Fields that store exact, non-analyzed comma-joined tokens in the index and so +# need explicit comma->AND expansion (Whoosh KEYWORD(commas=True) set). +MULTI_VALUE_FIELDS = frozenset({"tag", "tag_id", "viewer_id"}) + +# Date fields whose values/ranges get rewritten to RFC3339 Tantivy ranges. +DATE_FIELDS = frozenset({"created", "modified", "added"}) + +# Known schema fields: a comma immediately followed by ``:`` is a clause +# separator. Restricting to known fields prevents URL-like ``http:`` misfires. +KNOWN_FIELDS = frozenset( + { + "title", + "content", + "correspondent", + "document_type", + "type", + "storage_path", + "tag", + "tag_id", + "correspondent_id", + "document_type_id", + "storage_path_id", + "owner_id", + "viewer_id", + "asn", + "page_count", + "num_notes", + "created", + "modified", + "added", + "original_filename", + "checksum", + "notes", + "custom_fields", + }, +) + +_FIELD_RE = regex.compile(r"(?P\w+):") + + +@dataclass(frozen=True) +class FieldValue: + field: str + value: str + + +@dataclass(frozen=True) +class FieldValueList: + field: str + values: tuple[str, ...] + + +@dataclass(frozen=True) +class FieldRange: + field: str + open: str + lo: str + hi: str + close: str + + +@dataclass(frozen=True) +class Comma: + pass + + +@dataclass(frozen=True) +class Passthrough: + raw: str + + +Token = "FieldValue | FieldValueList | FieldRange | Comma | Passthrough" + +_CLOSE = {"[": "]", "{": "}"} + + +def scan(query: str) -> list: + """ + Tokenize a raw query into date/comma-aware tokens, leaving everything else + as verbatim ``Passthrough`` runs. Depth-aware over ``[]``/``{}`` and quotes. + """ + tokens: list = [] + buf: list[str] = [] # accumulates passthrough chars + i, n = 0, len(query) + + def flush() -> None: + if buf: + tokens.append(Passthrough("".join(buf))) + buf.clear() + + while i < n: + ch = query[i] + # A field token can begin only at a word boundary outside any value. + m = _FIELD_RE.match(query, i) + if m and (i == 0 or not query[i - 1].isalnum() and query[i - 1] != "_"): + field = m.group("field") + j = m.end() + if j < n and query[j] in "[{": + rng = _consume_range(query, j, field) + if rng is not None: + token, i = rng + flush() + tokens.append(token) + continue + else: + val = _consume_value(query, j) + if val is not None: + value, i = val + flush() + tokens.append(FieldValue(field, value)) + continue + buf.append(ch) + i += 1 + + flush() + return tokens + + +def _consume_range(query: str, start: int, field: str): + """Consume ``[lo TO hi]`` / ``{lo TO hi}`` from ``start`` (the bracket).""" + open_br = query[start] + close_br = _CLOSE[open_br] + end = query.find(close_br, start + 1) + if end == -1: + return None + inner = query[start + 1 : end] + parts = regex.split(r"\s+TO\s+", inner, maxsplit=1, flags=regex.IGNORECASE) + if len(parts) == 2: + lo, hi = parts[0].strip(), parts[1].strip() + else: + lo, hi = inner.strip(), "" + return FieldRange(field, open_br, lo, hi, close_br), end + 1 + + +def _consume_value(query: str, start: int): + """Consume a bare or quoted field value from ``start``.""" + n = len(query) + if start >= n or query[start] in " \t": + return None + if query[start] in "\"'": + quote = query[start] + end = query.find(quote, start + 1) + if end == -1: + return None + return query[start : end + 1], end + 1 + j = start + while j < n and query[j] not in " \t)": + j += 1 + return query[start:j], j +``` + +> **Note for the implementer:** the comma-stops-value behavior is intentionally +> deferred to Task 4. In this task a value like `tag:foo,bar` will scan as a +> single `FieldValue("tag", "foo,bar")`; Task 4's tests pin the final behavior. + +- [ ] **Step 4: Run to verify pass** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestScan -v` +Expected: PASS. If a case fails, fix `scan()` to satisfy the test (the tests are the spec). + +- [ ] **Step 5: Commit** + +```bash +git add src/documents/search/_translate.py src/documents/tests/search/test_translate.py +git commit -m "Feature: query scanner for field/range/passthrough tokens" +``` + +--- + +## Task 4: Comma resolution — value-lists and clause separators + +**Files:** + +- Modify: `src/documents/search/_translate.py` +- Test: `src/documents/tests/search/test_translate.py` + +- [ ] **Step 1: Write the failing test** + +Append: + +```python +from documents.search._translate import Comma +from documents.search._translate import FieldValueList +from documents.search._translate import resolve_commas + + +@pytest.mark.search +class TestCommaResolution: + def test_value_list_multi_value_field(self): + toks = resolve_commas(scan("tag:foo,bar")) + assert toks == [FieldValueList("tag", ("foo", "bar"))] + + def test_value_list_three(self): + toks = resolve_commas(scan("tag_id:1,2,3")) + assert toks == [FieldValueList("tag_id", ("1", "2", "3"))] + + def test_text_field_comma_is_literal(self): + # correspondent is not multi-value: comma stays inside the value. + toks = resolve_commas(scan("correspondent:foo,bar")) + assert toks == [FieldValue("correspondent", "foo,bar")] + + def test_clause_separator_before_known_field(self): + toks = resolve_commas(scan("tag:foo,type:bar")) + assert toks == [FieldValue("tag", "foo"), Comma(), FieldValue("type", "bar")] + + def test_clause_separator_after_range(self): + toks = resolve_commas(scan("created:[2020 TO 2021],added:[2022 TO 2023]")) + assert toks == [ + FieldRange("created", "[", "2020", "2021", "]"), + Comma(), + FieldRange("added", "[", "2022", "2023", "]"), + ] +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestCommaResolution -v` +Expected: FAIL (`resolve_commas` undefined, and current `scan` swallows commas into values). + +- [ ] **Step 3: Make `_consume_value` comma-aware and add `resolve_commas`** + +Update `_consume_value` so a value stops at a depth-0 comma **only** when that comma is a clause separator or a value-list separator; otherwise the comma stays in the value. Implement this by splitting value consumption in two stages: first consume up to whitespace/`)`/comma; then, in `scan`, when the next char is `,`, decide using the field and what follows. + +Replace the field-handling branch in `scan()` and add the resolver: + +```python +def _consume_value(query: str, start: int): + """Consume a value, stopping at whitespace, ``)``, or a comma.""" + n = len(query) + if start >= n or query[start] in " \t": + return None + if query[start] in "\"'": + quote = query[start] + end = query.find(quote, start + 1) + if end == -1: + return None + return query[start : end + 1], end + 1 + j = start + while j < n and query[j] not in " \t),": + j += 1 + return query[start:j], j + + +def _looks_like_known_field(query: str, pos: int) -> bool: + """True if a known ``field:`` token starts at ``pos``.""" + m = _FIELD_RE.match(query, pos) + return bool(m and m.group("field") in KNOWN_FIELDS) + + +def resolve_commas(tokens: list) -> list: + """ + Collapse value-list commas into ``FieldValueList`` and keep clause-separator + commas as ``Comma``. (Clause-sep commas are already emitted by ``scan`` via + the value-stop logic; this pass folds value-lists.) + """ + out: list = [] + for tok in tokens: + if ( + isinstance(tok, FieldValue) + and tok.field in MULTI_VALUE_FIELDS + and "," in tok.value + ): + values = tuple(v for v in tok.value.split(",") if v) + out.append(FieldValueList(tok.field, values)) + else: + out.append(tok) + return out +``` + +Now update `scan()`'s comma handling: after consuming a `FieldValue`, if the next char is `,`, decide: + +- if `field in MULTI_VALUE_FIELDS` and the char after the comma is a bare term (no `:` field, not a bracket): keep consuming `,term` into the value (value-list, folded later); +- elif the comma is a clause separator (followed by a known `field:`): emit the `FieldValue`, then a `Comma` token, and continue after the comma; +- else: the comma is literal — append it to the value and keep going. + +Implement with a post-value loop in `scan` (replace the `FieldValue` emit branch): + +```python + val = _consume_value(query, j) + if val is not None: + value, k = val + # Handle trailing comma semantics. + while k < n and query[k] == ",": + nxt = k + 1 + if field in MULTI_VALUE_FIELDS and not _looks_like_known_field( + query, nxt + ) and (nxt >= n or query[nxt] not in "[{ \t),"): + more = _consume_value(query, nxt) + if more is None: + break + value = f"{value},{more[0]}" + k = more[1] + continue + break + flush() + tokens.append(FieldValue(field, value)) + i = k + if i < n and query[i] == "," and _looks_like_known_field( + query, i + 1 + ): + tokens.append(Comma()) + i += 1 + continue +``` + +Also handle the clause-separator comma that follows a `FieldRange` or quoted value. After emitting a `FieldRange` (in the range branch) or a quoted `FieldValue`, add the same post-emit check: if `query[i] == ","` and `_looks_like_known_field(query, i+1)`, emit `Comma()` and advance. Factor this into a helper: + +```python +def _maybe_comma(query: str, i: int, tokens: list) -> int: + if i < len(query) and query[i] == "," and _looks_like_known_field(query, i + 1): + tokens.append(Comma()) + return i + 1 + return i +``` + +Call `i = _maybe_comma(query, i, tokens)` right after appending a `FieldRange` and after appending a quoted `FieldValue`. + +> **Implementer note:** the comma rules are subtle. Treat the Task-4 tests as +> the contract and iterate `scan`/`resolve_commas` until all pass. Add a test +> for `correspondent:"A B",created:[2020 TO 2021]` (clause-sep after a quote) +> and `http://example.com/a,b` (must remain a single `Passthrough`, since +> `http` is not a known field — the comma stays literal). + +- [ ] **Step 4: Add the two extra cases to the test** + +```python + def test_clause_separator_after_quote(self): + toks = resolve_commas(scan('correspondent:"A B",created:[2020 TO 2021]')) + assert toks == [ + FieldValue("correspondent", '"A B"'), + Comma(), + FieldRange("created", "[", "2020", "2021", "]"), + ] + + def test_url_comma_is_literal_passthrough(self): + toks = resolve_commas(scan("http://example.com/a,b")) + assert toks == [Passthrough("http://example.com/a,b")] +``` + +- [ ] **Step 5: Run to verify pass** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestCommaResolution -v` +Expected: PASS. + +- [ ] **Step 6: Commit** + +```bash +git add src/documents/search/_translate.py src/documents/tests/search/test_translate.py +git commit -m "Feature: comma resolution (value-list vs clause separator)" +``` + +--- + +## Task 5: `translate_date_value` — scalar values + +**Files:** + +- Modify: `src/documents/search/_translate.py` +- Test: `src/documents/tests/search/test_translate.py` + +- [ ] **Step 1: Write the failing test (verified ground-truth rows)** + +```python +from zoneinfo import ZoneInfo + +from documents.search._translate import NO_MATCH +from documents.search._translate import translate_scalar + +UTC_TZ = ZoneInfo("UTC") + + +@pytest.mark.search +class TestTranslateScalar: + @pytest.mark.parametrize( + ("field", "value", "expected"), + [ + ("created", "2020", "created:[2020-01-01T00:00:00Z TO 2021-01-01T00:00:00Z]"), + ("created", "202003", "created:[2020-03-01T00:00:00Z TO 2020-04-01T00:00:00Z]"), + ("created", "20200115", "created:[2020-01-15T00:00:00Z TO 2020-01-16T00:00:00Z]"), + ("created", "2020-01-15", "created:[2020-01-15T00:00:00Z TO 2020-01-16T00:00:00Z]"), + ("created", "2020-03", "created:[2020-03-01T00:00:00Z TO 2020-04-01T00:00:00Z]"), + ], + ) + def test_partial_and_iso_dates(self, field, value, expected): + assert translate_scalar(field, value, UTC_TZ) == expected + + def test_invalid_date_is_no_match(self): + assert translate_scalar("created", "202023", UTC_TZ) == NO_MATCH + + def test_keyword_delegates(self): + # keyword path produces a range; just assert it is a created range + out = translate_scalar("created", "today", UTC_TZ) + assert out.startswith("created:[") and out.endswith("]") +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestTranslateScalar -v` +Expected: FAIL (`translate_scalar`, `NO_MATCH` undefined). + +- [ ] **Step 3: Implement `translate_scalar` and `NO_MATCH`** + +Add to `_translate.py`: + +```python +from documents.search._dates import _DATE_KEYWORDS +from documents.search._dates import _date_only_range +from documents.search._dates import _datetime_range +from documents.search._dates import _field_range_from_dates +from documents.search._dates import _precision_bounds + +# A valid Tantivy clause that parses but matches nothing (degenerate range on a +# date field). Used for unparsable dates, matching Whoosh's NullQuery. +NO_MATCH = "created:[9999-12-31T23:59:59Z TO 9999-12-31T23:59:59Z]" + +_DIGITS_RE = regex.compile(r"^\d{4}(?:\d{2}){0,2}$") +_ISO_RE = regex.compile(r"^\d{4}(?:-\d{2}(?:-\d{2})?)?$") + + +def translate_scalar(field: str, value: str, tz: tzinfo) -> str: + """Translate a bare date-field value to a Tantivy range string.""" + bare = value.strip("\"'").lower() + if bare in _DATE_KEYWORDS: + if field in DATE_FIELDS and field == "created": + return f"{field}:{_date_only_range(bare, tz)}" + return f"{field}:{_datetime_range(bare, tz)}" + digits = value.replace("-", "") + if _DIGITS_RE.match(value) or _ISO_RE.match(value): + bounds = _precision_bounds(digits) + if bounds is None: + return NO_MATCH + return _field_range_from_dates(field, bounds[0], bounds[1], tz) + # Unrecognized shape -> no-match rather than a Tantivy 400 (Whoosh parity). + return NO_MATCH +``` + +> **Implementer note:** the `created` vs `added`/`modified` keyword split reuses +> the existing `_date_only_range`/`_datetime_range`. Use `_DATE_ONLY_FIELDS` +> from `_dates` instead of the literal `== "created"` if you prefer; behavior is +> identical for the current field set. + +- [ ] **Step 4: Run to verify pass** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestTranslateScalar -v` +Expected: PASS. + +- [ ] **Step 5: Verify the NO_MATCH string actually parses in Tantivy** + +Run: + +```bash +cd src && PAPERLESS_SECRET_KEY=ci uv run python -c " +import django, os, tempfile +os.environ.setdefault('DJANGO_SETTINGS_MODULE','paperless.settings'); django.setup() +import tantivy +from documents.search._schema import build_schema +from documents.search._tokenizer import register_tokenizers +from documents.search._translate import NO_MATCH +from documents.search._query import DEFAULT_SEARCH_FIELDS, _FIELD_BOOSTS +idx = tantivy.Index(build_schema(), path=tempfile.mkdtemp()); register_tokenizers(idx,'english') +idx.parse_query(NO_MATCH, DEFAULT_SEARCH_FIELDS, field_boosts=_FIELD_BOOSTS) +print('NO_MATCH parses OK') +" +``` + +Expected: `NO_MATCH parses OK`. If it raises, change `NO_MATCH` to a form that parses (e.g. an equal-bound inclusive range is valid; if not, use `[9999-12-31T23:59:59Z TO 9999-12-31T23:59:58Z]` only if Tantivy tolerates reversed — otherwise keep equal bounds). + +- [ ] **Step 6: Commit** + +```bash +git add src/documents/search/_translate.py src/documents/tests/search/test_translate.py +git commit -m "Feature: scalar date-value translation with no-match fallback" +``` + +--- + +## Task 6: `translate_date_value` — ranges (partial / now / open / reversed) + +**Files:** + +- Modify: `src/documents/search/_translate.py` +- Test: `src/documents/tests/search/test_translate.py` + +- [ ] **Step 1: Write the failing test** + +```python +from documents.search._translate import OPEN_HI +from documents.search._translate import OPEN_LO +from documents.search._translate import translate_range + + +@pytest.mark.search +class TestTranslateRange: + @pytest.mark.parametrize( + ("lo", "hi", "expected"), + [ + ("2005", "2009", "created:[2005-01-01T00:00:00Z TO 2010-01-01T00:00:00Z]"), + ("202001", "202006", "created:[2020-01-01T00:00:00Z TO 2020-07-01T00:00:00Z]"), + ("20200101", "20201231", "created:[2020-01-01T00:00:00Z TO 2021-01-01T00:00:00Z]"), + ("2020-01-01", "2020-12-31", "created:[2020-01-01T00:00:00Z TO 2021-01-01T00:00:00Z]"), + ], + ) + def test_absolute_ranges(self, lo, hi, expected): + assert translate_range("created", lo, hi, UTC_TZ) == expected + + def test_reversed_swaps(self): + assert translate_range("created", "2009", "2005", UTC_TZ) == ( + "created:[2005-01-01T00:00:00Z TO 2010-01-01T00:00:00Z]" + ) + + def test_open_upper(self): + out = translate_range("created", "2020", "", UTC_TZ) + assert out == f"created:[2020-01-01T00:00:00Z TO {OPEN_HI}]" + + def test_open_lower(self): + out = translate_range("created", "", "2020", UTC_TZ) + assert out == f"created:[{OPEN_LO} TO 2021-01-01T00:00:00Z]" + + def test_invalid_bound_is_no_match(self): + assert translate_range("created", "202023", "2025", UTC_TZ) == NO_MATCH +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestTranslateRange -v` +Expected: FAIL. + +- [ ] **Step 3: Implement `translate_range`, `OPEN_LO`, `OPEN_HI`** + +```python +from datetime import UTC +from datetime import datetime + +from documents.search._dates import _DATE_ONLY_FIELDS +from documents.search._dates import _fmt + +OPEN_LO = "0001-01-01T00:00:00Z" +OPEN_HI = "9999-12-31T23:59:59Z" + + +def _bound_datetimes(field: str, token: str, tz: tzinfo): + """ + Return (floor_dt, ceil_dt) UTC datetimes for a single range bound token, or + None if the token is unparsable. ``now`` resolves to the current instant. + """ + token = token.strip() + if token.lower() == "now": + now = datetime.now(UTC) + return now, now + digits = token.replace("-", "") + bounds = _precision_bounds(digits) + if bounds is None: + return None + start, end = bounds + if field in _DATE_ONLY_FIELDS: + lo = datetime(start.year, start.month, start.day, tzinfo=UTC) + hi = datetime(end.year, end.month, end.day, tzinfo=UTC) + else: + lo = datetime(start.year, start.month, start.day, tzinfo=tz).astimezone(UTC) + hi = datetime(end.year, end.month, end.day, tzinfo=tz).astimezone(UTC) + return lo, hi + + +def translate_range(field: str, lo: str, hi: str, tz: tzinfo) -> str: + """Translate a date-field ``[lo TO hi]`` range to a Tantivy ISO range.""" + lo_s = lo.strip() + hi_s = hi.strip() + + lo_iso = OPEN_LO + hi_iso = OPEN_HI + + if lo_s: + b = _bound_datetimes(field, lo_s, tz) + if b is None: + return NO_MATCH + lo_iso = _fmt(b[0]) # floor + if hi_s: + b = _bound_datetimes(field, hi_s, tz) + if b is None: + return NO_MATCH + hi_iso = _fmt(b[1]) # ceil (exclusive end / start of next period) + + # Reversed explicit range: swap so it spans the intended window. + if lo_s and hi_s and lo_iso > hi_iso: + lo_iso, hi_iso = hi_iso, lo_iso + return f"{field}:[{lo_iso} TO {hi_iso}]" +``` + +> **Note:** for `now` bounds, floor==ceil==the instant, so `[20200101 TO now]` +> uses `floor(20200101)`..`now`. Reversed-swap uses ISO string comparison, which +> is correct for RFC3339 `...Z` values (lexicographic == chronological). + +- [ ] **Step 4: Run to verify pass** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestTranslateRange -v` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add src/documents/search/_translate.py src/documents/tests/search/test_translate.py +git commit -m "Feature: date range translation incl. open-ended and reversed" +``` + +--- + +## Task 7: `translate_query` — full pipeline + parse-acceptance + +**Files:** + +- Modify: `src/documents/search/_translate.py` +- Test: `src/documents/tests/search/test_translate.py` + +- [ ] **Step 1: Write the failing test (golden cases + parse-acceptance)** + +```python +import tempfile + +import tantivy + +from documents.search._query import DEFAULT_SEARCH_FIELDS +from documents.search._query import _FIELD_BOOSTS +from documents.search._schema import build_schema +from documents.search._tokenizer import register_tokenizers +from documents.search._translate import translate_query + + +@pytest.fixture(scope="module") +def index(): + idx = tantivy.Index(build_schema(), path=tempfile.mkdtemp()) + register_tokenizers(idx, "english") + return idx + + +@pytest.mark.search +class TestTranslateQuery: + @pytest.mark.parametrize( + ("raw", "expected"), + [ + ("created:2020", "created:[2020-01-01T00:00:00Z TO 2021-01-01T00:00:00Z]"), + ("tag:foo,bar", "tag:foo AND tag:bar"), + ("tag:foo,type:bar", "tag:foo AND type:bar"), + ( + "created:[2020 TO 2021],added:[2022 TO 2023]", + "created:[2020-01-01T00:00:00Z TO 2022-01-01T00:00:00Z] AND " + "added:[2022-01-01T00:00:00Z TO 2024-01-01T00:00:00Z]", + ), + ("correspondent:foo,bar", "correspondent:foo,bar"), + ], + ) + def test_golden(self, raw, expected): + assert translate_query(raw, UTC_TZ) == expected + + @pytest.mark.parametrize( + "raw", + [ + "created:2020", + "created:202003", + "created:[20200101 TO 20201231]", + "created:[2020-01-01 TO 2020-12-31]", + "created:[2020 to]", + "created:[to 2020]", + "title:x,created:[2020 TO 2021]", + "created:2020 OR foo", + "(created:2020 OR invoice)", + "tag:foo,type:bar", + "created:202023", + "bank statement", + ], + ) + def test_parse_acceptance(self, index, raw): + translated = translate_query(raw, UTC_TZ) + # Must not raise: + index.parse_query(translated, DEFAULT_SEARCH_FIELDS, field_boosts=_FIELD_BOOSTS) +``` + +- [ ] **Step 2: Run to verify it fails** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestTranslateQuery -v` +Expected: FAIL (`translate_query` not yet assembled / recombination missing). + +- [ ] **Step 3: Implement `translate_query` (recombination)** + +```python +def _render(tokens: list, tz: tzinfo) -> str: + parts: list[str] = [] + for tok in tokens: + if isinstance(tok, Passthrough): + parts.append(tok.raw) + elif isinstance(tok, Comma): + parts.append(" AND ") + elif isinstance(tok, FieldValueList): + parts.append(" AND ".join(f"{tok.field}:{v}" for v in tok.values)) + elif isinstance(tok, FieldValue): + if tok.field in DATE_FIELDS: + parts.append(translate_scalar(tok.field, tok.value, tz)) + else: + parts.append(f"{tok.field}:{tok.value}") + elif isinstance(tok, FieldRange): + if tok.field in DATE_FIELDS: + parts.append(translate_range(tok.field, tok.lo, tok.hi, tz)) + else: + parts.append(f"{tok.field}:{tok.open}{tok.lo} TO {tok.hi}{tok.close}") + return "".join(parts) + + +def translate_query(raw: str, tz: tzinfo) -> str: + """Translate a raw Whoosh-style query into Tantivy-compatible syntax.""" + tokens = resolve_commas(scan(raw)) + return _render(tokens, tz) +``` + +- [ ] **Step 4: Run to verify pass** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_translate.py::TestTranslateQuery -v` +Expected: PASS. Iterate `scan`/`resolve_commas`/`_render` until all golden + parse-acceptance cases pass. + +- [ ] **Step 5: Commit** + +```bash +git add src/documents/search/_translate.py src/documents/tests/search/test_translate.py +git commit -m "Feature: assemble translate_query pipeline with parse-acceptance tests" +``` + +--- + +## Task 8: Wire into `parse_user_query`, delegate old functions, end-to-end test + +**Files:** + +- Modify: `src/documents/search/_query.py` +- Test: existing `test_query.py`; new end-to-end test in the search-API test module. + +- [ ] **Step 1: Route `parse_user_query` through `translate_query` with a safety net** + +In `_query.py`, import and use the new pipeline: + +```python +from documents.search._translate import translate_query +``` + +Replace the first two lines of `parse_user_query` body: + +```python + try: + query_str = translate_query(raw_query, tz) + except Exception: # pragma: no cover - defensive + logger.warning("Query translation failed; using raw query", exc_info=True) + query_str = raw_query +``` + +(Ensure a module-level `logger = logging.getLogger("paperless.search")` exists; add it if missing.) + +- [ ] **Step 2: Delegate the old functions** + +Replace the bodies of `rewrite_natural_date_keywords` and `normalize_query` so they route through the new pipeline while preserving their existing tested outputs. Because the new pipeline does both date-rewriting and normalization in one pass, make `normalize_query` the full pipeline and `rewrite_natural_date_keywords` a no-op-preserving shim: + +```python +def rewrite_natural_date_keywords(query: str, tz: tzinfo) -> str: + # Back-compat shim: date rewriting now happens inside translate_query. + # Kept (with its tests) until callers/tests migrate to _translate. + from documents.search._translate import translate_query + + return translate_query(query, tz) + + +def normalize_query(query: str) -> str: + # Back-compat shim: comma/operator normalization now happens in translate_query. + from zoneinfo import ZoneInfo + + from documents.search._translate import translate_query + + return translate_query(query, ZoneInfo("UTC")) +``` + +- [ ] **Step 3: Run the existing search-query tests; update only where the new canonical output differs** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/search/test_query.py -v` +Expected: Most PASS. For any assertion that now differs because the new pipeline produces an equivalent-but-not-byte-identical string, update that assertion to the new canonical output (the spec authorized this as part of "delegate + plan removal"). Record each change in the commit message. + +> If a large number of `test_query.py` assertions break, prefer marking the +> superseded string-output tests for removal (they are scheduled for deletion) +> rather than rewriting many — but keep at least the behavioral coverage that is +> not duplicated in `test_translate.py`. + +- [ ] **Step 4: Add the end-to-end 400→200 test** + +First find the search-API test module: `rg -l "search" src/documents/tests/test_api*.py`. Add (adjust module/class to the existing pattern): + +```python +@pytest.mark.search +@pytest.mark.parametrize( + "query", + [ + "created:2020", + "created:[20200101 TO 20201231]", + "title:x,created:[2020 TO 2021]", + ], +) +def test_advanced_search_no_longer_400s(self, query): + response = self.client.get(f"/api/documents/?query={query}") + assert response.status_code == 200 +``` + +- [ ] **Step 5: Run the end-to-end test** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest --override-ini="addopts=" documents/tests/test_api_search.py -k no_longer_400 -v` (use the real module path). +Expected: PASS (200, not 400). + +- [ ] **Step 6: Commit** + +```bash +git add src/documents/search/_query.py src/documents/tests/ +git commit -m "Feature: route advanced search through structural translation layer" +``` + +--- + +## Task 9: Full search suite, lint, and design-doc status update + +**Files:** + +- Modify: `docs/superpowers/specs/2026-06-14-search-query-translation-design.md` (status line) + +- [ ] **Step 1: Run the full search marker suite** + +Run: `cd src && PAPERLESS_SECRET_KEY=ci uv run pytest -m search` +Expected: PASS (full search suite, with coverage/xdist as configured). + +- [ ] **Step 2: Lint and format** + +Run: `uv run ruff check src/documents/search/ src/documents/tests/search/ && uv run ruff format src/documents/search/ src/documents/tests/search/` +Expected: no errors; formatting clean. + +- [ ] **Step 3: Run pre-commit hooks on changed files** + +Run: `uv run prek run --files src/documents/search/_translate.py src/documents/search/_dates.py src/documents/search/_query.py` +Expected: PASS. + +- [ ] **Step 4: Update the spec status line** + +Change the spec's `**Status:**` line to `Phase 1 implemented`. Add a one-line pointer to this plan. + +- [ ] **Step 5: Commit** + +```bash +git add docs/superpowers/specs/2026-06-14-search-query-translation-design.md +git commit -m "Docs: mark query-translation Phase 1 implemented" +``` + +--- + +## Deferred (not in this plan) + +- **Phase 2 (query objects):** build `tantivy.Query` objects for date clauses (`range_query(None)` open bounds, `empty_query()` no-match). Gated on the upstream tantivy-py date-value contribution documented in the spec §9. Separate plan. +- **Bare relative scalar** (`created:-1week`): currently falls to `NO_MATCH` (no-400). If desired later, add a relative-shape branch to `translate_scalar`. +- **Removal of `rewrite_natural_date_keywords` / `normalize_query`** and their string-output tests, once `test_translate.py` fully covers them. +- **Unknown-field 400** (`http:`), out of scope per spec §10. + +--- + +## Self-Review notes (addressed) + +- **Spec coverage:** scanner (Task 3-4), date scalar (Task 5), date range incl. open/reversed (Task 6), comma rules incl. multi-value scoping (Task 4), safety net + delegation (Task 8), tests incl. parse-acceptance + end-to-end (Tasks 7-8), `_dates` extraction (Task 1). Phase 2 + unknown-field explicitly deferred. +- **No placeholders:** every code step shows complete code; the two `> Implementer note` blocks flag where the scanner must be iterated against its own tests, not skipped. +- **Type consistency:** token names (`FieldValue`, `FieldValueList`, `FieldRange`, `Comma`, `Passthrough`), functions (`scan`, `resolve_commas`, `translate_scalar`, `translate_range`, `translate_query`, `_precision_bounds`, `_field_range_from_dates`, `_bound_datetimes`), and constants (`NO_MATCH`, `OPEN_LO`, `OPEN_HI`, `MULTI_VALUE_FIELDS`, `DATE_FIELDS`, `KNOWN_FIELDS`) are used consistently across tasks. diff --git a/docs/superpowers/done/specs/2026-06-14-search-query-translation-design.md b/docs/superpowers/done/specs/2026-06-14-search-query-translation-design.md new file mode 100644 index 000000000..5e63bf3fb --- /dev/null +++ b/docs/superpowers/done/specs/2026-06-14-search-query-translation-design.md @@ -0,0 +1,407 @@ +# Design: Whoosh→Tantivy Advanced-Query Translation Layer + +**Date:** 2026-06-14 +**Status:** Phase 1 implemented on branch `fix/search-query-translation` (string-pipeline translation layer in `_translate.py`/`_dates.py`, wired into `parse_user_query`). Phase 2 (Query objects) remains gated on the tantivy-py release noted in §8/§9. Plan: `docs/superpowers/plans/2026-06-14-search-query-translation.md`. +**Branch context:** `beta`. Search code: `src/documents/search/`. +**Related:** `SEARCH_TANTIVY_WHOOSH_COMPAT.md` (repo root) — full empirical gap matrix and reproduction harnesses. Open branch `fix/scope-comma-expansion` (commit `d8fa97232`) — partial comma fix this design subsumes. + +--- + +## 1. Problem + +Paperless migrated full-text search from Whoosh (v2) to Tantivy (v3, commit `aed9abe48`, #12471). A +compatibility layer in `_query.py` rewrites old Whoosh query syntax into Tantivy syntax via a stack of +ordered regex substitutions before calling `tantivy.Index.parse_query`. + +That regex stack is piecemeal and has hit its complexity ceiling: + +- **No structural awareness.** It runs regex on a flat string, so it cannot distinguish a comma inside + `[...]` from a top-level clause separator, or know whether a `:` is a field prefix or text. This causes + real bugs (e.g. `title:x,created:[2020 TO 2021]` rewrites to malformed `title:x AND title:created:[...]`). +- **Order-dependence.** Six rewriters with implicit ordering contracts (14-digit before 8-digit, year-range + before 8-digit, etc.). Each new date form means reasoning about all interactions again. + +The result is a class of v2-valid queries that now return **HTTP 400**. There is no fallback: any syntax +Tantivy rejects raises out of `parse_query`, propagates through `_backend.py` (no try/except), and is caught +by the generic handler in `views.py:2471-2475` → `HttpResponseBadRequest`, with the real error only in logs. + +### Confirmed regressions (empirically reproduced; full table in `SEARCH_TANTIVY_WHOOSH_COMPAT.md` §5) + +| Class | Example | Today | Whoosh v2 | +| ------------------------ | -------------------------------------------------------------- | ---------------------- | --------------------------- | +| Bare date on date field | `created:2020`, `created:202003` | 400 | full-year / full-month span | +| Bracketed absolute range | `created:[20200101 TO 20201231]`, `[2020-01-01 TO 2020-12-31]` | 400 | floor/ceil range | +| Open-ended range | `created:[2020 to]`, `created:[to 2020]` | 400 | `>=` / `<=` range | +| Comma between clauses | `title:x,created:[...]` | 400 (malformed) | AND, both sides | +| Comma value-list scope | `tag:foo,type:bar` | wrong (`tag:type:bar`) | `tag:foo AND type:bar` | +| Invalid date | `created:202023` | 400 | NullQuery (no-match) | + +--- + +## 2. Goals / Non-goals + +**Goals** + +- Eliminate the date- and comma-class 400s by translating those forms to valid Tantivy syntax. +- Replace the order-dependent regex stack with a structural, context-aware pass. +- Match empirically-verified Whoosh v2 semantics (see §3). +- Additive tests: existing suite stays green during transition. +- **Field-name aliasing for the four renamed Whoosh→Tantivy fields** (added to scope 2026-06-14): + `type`→`document_type`, `type_id`→`document_type_id`, `path`→`storage_path`, `path_id`→`storage_path_id`. + These are the only fields the Tantivy migration renamed; v2 queries using the old names currently 400. + Both old and new spellings work after aliasing (new names pass through verbatim). The alias targets are the + text "name" fields (`document_type` is populated from `document_type.name`), so `type:invoice` → + `document_type:invoice` is correct. Fields with no Tantivy equivalent (`owner`, the `has_*` booleans, + `is_shared`, `custom_field_count`, `custom_fields_id`) are NOT aliased and remain out of scope. + +**Non-goals (explicitly out of scope)** + +- Full Whoosh query-language parity. +- Other Whoosh divergences: unknown-field-degrades-to-text (`http://x/a,b` → 400 on the `http:` unknown + field), tolerant unbalanced parens, case-insensitive `AND/OR/NOT`. These pass through to Tantivy unchanged + and are recorded as separate, known gaps (§10). +- `>`/`<`/`>=`/`<=` comparison operators — never supported in paperless-Whoosh (no `GtLtPlugin`); adding them + would be a new feature, not a compat fix. + +--- + +## 3. Empirical ground truth (verified, not inferred) + +Both engines were run directly; do not regress these without re-checking. + +**Whoosh v2** (paperless's exact `MultifieldParser([...]) + DateParserPlugin(basedate=...)` setup): + +- `created:2020` → `DateRange(2020-01-01 .. 2020-12-31 23:59:59)`; `created:202003` → March 2020. +- `created:202023` (month 23) → `<_NullQuery>` — **invalid dates match nothing, never error.** +- `created:[202001 TO 202006]` → floor/ceil partial-date bounds; `[2020 to]` / `[to 2020]` → open bounds. +- `created:-1week` → an exact-microsecond `Term` — parsed but matches ~nothing (useless in v2). +- Comma = AND between clauses, both preserved: `created:[r],added:[r]`, `correspondent:acme,created:[...]`, + `invoice,created:2020`. +- Comma value-list **only** for `KEYWORD(commas=True)` fields (`tag`, `tag_id`, `viewer_id`): + `tag:a,b` → `tag:a AND tag:b`. Text-field commas (`correspondent:foo,bar`, `title:10,20`) are split by the + field **analyzer** at parse time, not the comma plugin. +- `title:x,created:[...]` → only the DateRange (Whoosh drops `title:x`) — a v2 free-mode **bug**; the correct + target keeps both sides. + +**Tantivy 0.26.0** (`tantivy v0.26.0, index_format v7`): + +- Date fields require RFC3339 (`...Z`) literals; rejects bare `2020`, `20200101`, `2020-01-01`, lowercase + open ranges. +- Text-field commas parse fine verbatim (`correspondent:foo,bar`, `title:10,20`, `content:a,b,c`). +- Boolean/paren/phrase structure parses correctly, so a translated date token can sit anywhere: + `created:[...Z TO ...Z] OR foo` and `(created:[...] OR foo)` both parse. +- String date sentinels `0001-01-01T00:00:00Z` and `9999-12-31T23:59:59Z` both parse on a date field. + +--- + +## 4. Architecture (Approach 1: flat tokenizing scanner + single date translator) + +The scanner specializes only the date/comma tokens and treats everything else (operators, parens, phrases, +words, wildcards) as opaque passthrough. Tantivy keeps doing boolean/grouping/phrase parsing. A `field:value` +span is locally recognizable regardless of surrounding boolean context, so the scanner needs no understanding +of `AND/OR/NOT`. + +### 4.1 Module layout + +New module `src/documents/search/_translate.py` — single source of truth: + +``` +translate_query(raw: str, tz) -> str # top-level: scan → transform → recombine + scan(raw) -> list[Token] # depth-aware char-walk tokenizer + _resolve_commas(tokens) -> list[Token] # comma → AND / value-list / literal + translate_date_value(field, raw, tz) -> str # shape-dispatch date translator +``` + +Date-boundary math (`_date_only_range`, `_datetime_range`, floor/ceil helpers) **moves** from `_query.py` +into `_translate.py` (or a small shared `_dates.py`) so there is one home. The existing math is reused +verbatim — not rewritten. + +### 4.2 Data flow + +``` +parse_user_query(raw, tz) + → translate_query(raw, tz) # NEW pipeline + → index.parse_query(translated, DEFAULT_SEARCH_FIELDS, field_boosts=_FIELD_BOOSTS) +``` + +### 4.3 Transition (delegate + planned removal) + +- `rewrite_natural_date_keywords` and `normalize_query` become thin delegators to `translate_query` (or its + sub-steps) so their existing assertions still pass. +- The plan **explicitly schedules deleting both functions and their string-output tests** once + `test_translate.py` covers them. Single source of truth, no lingering dead code. + +### 4.4 Safety net + +`parse_user_query` wraps `translate_query` in try/except. On any unexpected scanner error it falls back to the +**raw** query string (today's behavior) and logs a warning. The new layer can never regress below current +behavior; worst case equals the status quo. + +--- + +## 5. Scanner token model + +`scan()` is a single left-to-right char walk tracking **quote state** and **`[]`/`{}` bracket depth**. Token +kinds: + +- **`FieldValue(field, value)`** — `field:value`, value a single bare token (no brackets). Recognized when, + outside quotes/brackets, it sees `\w+:` followed by a non-bracket value. Value runs until whitespace, a + resolved clause-comma, `)`, or end (may itself be quoted: `correspondent:"A B"`). +- **`FieldValueList(field, [v1, v2, …])`** — value-list, **only** for `field ∈ {tag, tag_id, viewer_id}`. A + `FieldValue` whose value is immediately followed by `,term` runs with **no spaces and no colon** in the + continuation terms. The no-colon rule fixes `tag:foo,type:bar` (the `type:bar` is not swallowed). +- **`FieldRange(field, open, lo, hi, close)`** — `field:[lo TO hi]` / `{…}`. Split on case-insensitive + `TO`; `lo`/`hi` may be empty (open). Consumed to the matching close bracket. +- **`Comma`** — emitted only when a depth-0 comma resolves to a clause separator (see §7). +- **`Passthrough(raw)`** — everything else, byte-for-byte: operators (`AND OR NOT + -`), parens, bare words, + wildcards, phrases/quoted spans, whitespace. + +**Key properties** + +- `field:value` is recognized at any paren depth but **never inside `[]`/`{}` or quotes** — so + `(created:2020 OR foo)` still finds the date token, and commas inside `[2020 TO 2021]` or `"a,b"` are never + clause separators. +- Only date fields (`created`, `modified`, `added`) trigger date translation. Every other `field:value` / + `field:range` (`tag:`, `asn:`, unknown fields) and every `Passthrough` is re-emitted verbatim — preserving + queries Tantivy already handles. +- Multi-valued set is exactly `{tag, tag_id, viewer_id}`. `custom_fields` is now a JSON structure in the index + (Whoosh smashed it into a comma-keyword field; the JSON path handles it better) and is **not** comma-split. + +--- + +## 6. `translate_date_value` — shape dispatch + +One entry point per token type, both emitting `field:[ TO ]`. `created` uses date-only +(UTC-midnight) boundaries; `added`/`modified` use local-tz-midnight→UTC. All boundary math reuses the +existing tested helpers. + +### Scalar value (`FieldValue` on a date field) + +| Shape | Example | Result | Status | +| ----------------------- | ---------------------------------- | ------------------------------------------------------------- | ----------- | +| Keyword (opt. quoted) | `created:today`, `"previous week"` | existing keyword ranges | works today | +| 4-digit `YYYY` | `created:2020` | full-year span, emitted as `[2020-01-01T…Z TO 2021-01-01T…Z]` | NEW | +| 6-digit `YYYYMM` | `created:202003` | month span | NEW | +| 8-digit `YYYYMMDD` | `created:20200101` | day span | works today | +| 14-digit | `…120000` | exact-second point `[t TO t]` | works today | +| ISO dashed | `created:2020-01`, `2020-01-01` | strip separators → digit-precision span | NEW | +| Bare relative `-N unit` | `created:-1week` | `[t TO t]` instant (effectively no-match, matches v2) | NEW (P3) | +| Invalid / unparsable | `created:202023` | **no-match clause, never 400** | NEW | + +### Range (`FieldRange`) + +Parse each bound with the same shape parser, then `floor(lo)` / `ceil(hi)`: + +- Partial / ISO / 8-digit / 14-digit bounds: `[202001 TO 202006]`, `[2020-01-01 TO 2020-12-31]` — NEW. +- `now` bound: `[20200101 TO now]` — NEW. +- Open bound (empty side): `[2020 to]`, `[to 2020]` → sentinel far-past floor / far-future ceil (§8) — NEW. +- Relative bound: generalize existing `[-N unit to now]` so `-N unit` works on either side. +- Reversed (`lo>hi`): swap (existing year-range `min/max` + Whoosh `disambiguated` behavior). +- Bare year range `[2005 to 2009]`: unchanged (works today). + +**Boundary convention:** keep the existing "ceil = start of next period, inclusive bracket" (e.g. +`[2005-01-01 .. 2010-01-01]`) that current tests encode. Do not switch to Whoosh's `23:59:59.999999`; document +the one-instant boundary difference. + +--- + +## 7. Comma resolution + +A depth-0 comma is resolved three ways (this single rule set subsumes both `fix/scope-comma-expansion` and +the unstaged `]`/`"` fix, and fixes Gap E): + +1. **Value-list** — preceding token is a `FieldValue`/`FieldValueList` on `{tag, tag_id, viewer_id}` and the + following continuation is a bare, colon-free term → repeat the field: `tag:a,b,c` → `tag:a AND tag:b AND tag:c`. +2. **Clause separator → `AND`** — fires only at a structured boundary: + - (a) the comma is preceded by a closing `]` or `"` (`created:[r],added:[r]`, `correspondent:"A B",created:[r]`), or + - (b) the comma is followed by a **known schema** `field:` (`title:foo,created:[r]`, `correspondent:foo,created:[r]`). + Requiring a _known_ field for (b) prevents `http://x,…`-style misfires. +3. **Literal** — anything else (a comma followed by a bare term on a non-multivalue field) stays in place: + `correspondent:foo,bar`, `title:10,20`, URLs. Tantivy's analyzer tokenizes these on punctuation, matching + Whoosh's analyzer behavior. + +--- + +## 8. Open-range handling & the two phases + +**Phase 1 (this work) — string output, no tantivy change.** +Open bounds use verified string sentinels: lower-open → `0001-01-01T00:00:00Z`, upper-open → `9999-12-31T23:59:59Z` +(both confirmed to parse on a date field in 0.26.0). No-match (invalid date) uses a degenerate date range +(exact representation flagged for verification in §11). + +**Phase 2 (stretch) — build `tantivy.Query` objects for date clauses.** +`Query.range_query(..., lower_bound=None/upper_bound=None)` gives true open bounds and `empty_query()` gives a +real no-match, eliminating all string hacks. **Gated only on a released `tantivy-py` > 0.26.0 that includes +#655 + #666 — the code already exists on `tantivy-py` `master`, it just postdates the `0.26.0` wheel we pin +(`pyproject.toml`: `tantivy~=0.26.0`); see §9.** Splicing a Query object into an otherwise-string boolean query +is non-trivial, so Phase 2 is a separate, later effort; Phase 1 ships independently. + +Phase 2 also folds in the deferred Phase-1 cleanup (maintainer decision, 2026-06-15): + +- Replace the `NO_MATCH` degenerate-range sentinel with `Query.empty_query()` (this also retires the cosmetic + issue that `NO_MATCH` always names the `created` field regardless of the queried field). +- Replace `OPEN_LO`/`OPEN_HI` string sentinels with `range_query(..., None)` open bounds. +- Retire the now-dead `_rewrite_*` helpers and the `rewrite_natural_date_keywords`/`normalize_query` delegation + shims in `_query.py` (~160 lines left from the Phase-1 transition), and migrate their string-output tests in + `test_query.py` (replace the direct `_rewrite_compact_date` test with a `translate_scalar` test). + +--- + +## 9. Upstream tantivy-py contribution (PR-ready detail) + +> **STATUS UPDATE (2026-06-14): already implemented upstream on `master`.** The date-value gap below is +> closed by two merged `tantivy-py` commits that postdate the released `0.26.0` wheel we pin: +> **#655** (`feat: support unbounded range queries via None bounds`) and **#666** (`fix: add_date loses +tzinfo`, which added the `PyDateTime → tantivy DateTime` converter and routed both `range_query` and +> `term_query` through it). `range_query` with `datetime` (incl. `None` open bounds) and +> `term_query`/`term_set_query` with `datetime` on `Date` fields are verified working upstream with +> regression tests. **The Phase 2 blocker is therefore no longer a code contribution** — it is only a +> published `tantivy-py` release > `0.26.0` containing #655 + #666, plus bumping our pin +> (`pyproject.toml`: `tantivy~=0.26.0`). The PR-ready detail below is retained as the historical record of +> the gap as observed against `0.26.0`. + +**Repo:** `quickwit-oss/tantivy-py`. **Observed version:** `0.26.0` (`tantivy v0.26.0, index_format v7`). + +**Gap.** Python `datetime` objects cannot be passed to _any_ Query constructor for a `Date` field. Both +`Query.range_query` and `Query.term_query` reject them: + +``` +Expected DateTime type for field created, got datetime.datetime(2020, 1, 1, 0, 0, tzinfo=datetime.timezone.utc) +``` + +Int timestamps (seconds and nanoseconds) are also rejected, and there is no exposed/constructible +`tantivy.DateTime` (`hasattr(tantivy, "DateTime") is False`). Consequently **all** date querying in paperless +goes through `parse_query` strings; every object-mode `term_query` in the codebase is on integer fields +(`id`, `owner_id`, `viewer_id`). + +**Context.** PR #655 (merged 2026-04-27) added unbounded (`None`) bounds to `range_query`. That solved open +_bounds_ but left the date _value_ path unusable from Python, so the open-range feature can't actually be used +on date fields from Python yet. + +**Reproduction** (against installed 0.26.0): + +```python +import tantivy +from datetime import datetime, UTC +schema = build_schema() # any schema with a date field "created" +dt1, dt2 = datetime(2020,1,1,tzinfo=UTC), datetime(2021,1,1,tzinfo=UTC) + +tantivy.Query.range_query(schema, "created", tantivy.FieldType.Date, lower_bound=dt1, upper_bound=dt2) +# -> ValueError: Expected DateTime type for field created, got datetime.datetime(...) + +tantivy.Query.range_query(schema, "created", tantivy.FieldType.Date, lower_bound=dt1, upper_bound=None) +# -> same error (open bound is fine; the date VALUE is the problem) + +tantivy.Query.term_query(schema, "created", dt1) +# -> same error +``` + +**Proposed fix (preferred):** in the Rust binding, when the target field is `Date`, accept a Python +`datetime` and convert internally to `tantivy::DateTime` (e.g. `DateTime::from_timestamp_nanos(...)`), mirroring +the conversion the indexing path already performs when adding date values to a document (document add-date +already accepts `PyDateTime`). This makes `range_query`/`term_query` consistent with indexing. The value-coercion +lives in the Query-construction value handling (the term/bound extraction in the query bindings, e.g. +`src/query.rs`); reuse the existing `PyDateTime → tantivy DateTime` converter from the document bindings rather +than adding a new one. Confirm exact locations against the tantivy-py source at PR time. + +**Alternative:** expose a constructible `tantivy.DateTime` (from a Python `datetime` or an epoch-nanos int) and +accept it in `range_query`/`term_query`. Less ergonomic; only do this if reusing the indexing converter proves +awkward. + +**Validation for the PR:** + +- `range_query` on a `Date` field with two `datetime` bounds builds and returns expected hits. +- `range_query` with one `datetime` bound and one `None` (open) works on a `Date` field. +- `term_query` on a `Date` field with a `datetime` builds and matches. +- Round-trip: index a doc with a known date, query it back via both closed and open ranges. + +When this lands and we bump tantivy-py to the release containing it, Phase 2 (§8) becomes unblocked. + +--- + +## 10. Out of scope / known separate gaps + +- **Unknown-field 400.** `http://example.com/a,b` → `Field does not exist: 'http'`. Tantivy treats `http:` as + a field; Whoosh's `remove_unknown=True` degraded unknown fields to text. This is the unknown-field divergence, + not a comma or date issue. Recorded, not fixed here. +- `>`/`<`/`>=`/`<=` comparisons — never supported in paperless-Whoosh. +- Bare relative scalar (`created:-1week`) is P3: it "worked" in v2 but matched nothing. We only guarantee + no-400. + +--- + +## 11. Items to verify during implementation + +- Exact RFC3339 **open-bound sentinels** to standardize on (`0001-01-01T00:00:00Z` / `9999-12-31T23:59:59Z` + both parse; confirm they also behave in actual searches, not just parsing). +- The **no-match clause** string representation for a date field (a degenerate/empty range that parses but + matches nothing). In Phase 2 this becomes `empty_query()`. +- ISO-dashed precision handling parity with Whoosh's separator-stripping (`-`, `.`, space). +- Coordination with `fix/scope-comma-expansion`: either land this after that branch merges and delete its + now-redundant regex, or absorb its narrowing directly. Do not ship both comma implementations. + +--- + +## 12. Test plan (additive) + +- **`test_translate.py` (new):** + - `scan()` token-sequence tests: quotes, brackets, parens, URLs, value-lists, mixed clauses. + - `translate_date_value` shape table: every §6 row (scalar + range), all three date fields, + UTC/Eastern/Auckland timezones (reuse existing tz test patterns). + - comma resolution: value-list (`tag`/`tag_id`/`viewer_id`), clause-sep (after `]`/`"`, before known + `field:`), literal (text fields, URLs, `title:10,20`). + - `translate_query()` golden cases: the full §3 / report-§5b ground-truth matrix. +- **Parse-acceptance guardrail (current tests lack this):** for every golden case assert + `index.parse_query(translate_query(q))` does not raise, against a real index. +- **End-to-end:** a `views.py` search test asserting previously-400 v2 queries (`created:2020`, + `created:[20200101 TO 20201231]`, `title:x,created:[…]`) now return 200. +- Existing tests stay green via delegation; on removal of the old functions, migrate any unique assertions + into `test_translate.py`. + +--- + +## 13. Verification harnesses (keep for regression / ground-truth regeneration) + +**Tantivy side** (does a translated string parse?): + +```bash +cd src && PAPERLESS_SECRET_KEY=x uv run python -c " +import django, os, tempfile +os.environ.setdefault('DJANGO_SETTINGS_MODULE','paperless.settings'); django.setup() +import tantivy +from documents.search._schema import build_schema +from documents.search._tokenizer import register_tokenizers +from documents.search._query import DEFAULT_SEARCH_FIELDS, _FIELD_BOOSTS +idx = tantivy.Index(build_schema(), path=tempfile.mkdtemp()); register_tokenizers(idx,'english') +idx.parse_query('', DEFAULT_SEARCH_FIELDS, field_boosts=_FIELD_BOOSTS) +" +``` + +**Whoosh side** (what did v2 do? — ground truth): + +```bash +uv run --with cached_property python3 -W ignore -c " +import sys; sys.path.insert(0,'whoosh/src') +from datetime import datetime +from whoosh.fields import Schema, TEXT, DATETIME, KEYWORD +from whoosh.qparser import MultifieldParser +from whoosh.qparser.dateparse import DateParserPlugin +schema = Schema(title=TEXT(), content=TEXT(), correspondent=TEXT(), + tag=KEYWORD(commas=True, lowercase=True), tag_id=KEYWORD(commas=True), viewer_id=KEYWORD(commas=True), + type=TEXT(), created=DATETIME(), added=DATETIME(), modified=DATETIME(), notes=TEXT(), custom_fields=TEXT()) +qp = MultifieldParser(['content','title','correspondent','tag','type','notes','custom_fields'], schema) +qp.add_plugin(DateParserPlugin(basedate=datetime(2026,6,14,14,0,0))) +print(qp.parse('')) +" +``` + +--- + +## 14. Phased summary + +- **Phase 1 (now):** `_translate.py` scanner + `translate_date_value`, string output, sentinel open bounds, + delegation shims, additive tests, parse-acceptance guardrail, end-to-end 400→200 tests. Ships on tantivy + 0.26.0, no upstream dependency. Subsumes `fix/scope-comma-expansion`. +- **Phase 2 (later, gated on §9 upstream):** build `tantivy.Query` objects for date clauses — true open ranges + via `range_query(None)`, real no-match via `empty_query()`, no string sentinels. Requires the tantivy-py + date-value contribution and a version bump. diff --git a/docs/superpowers/specs/2026-06-14-search-phase2-tracking-prompt.md b/docs/superpowers/specs/2026-06-14-search-phase2-tracking-prompt.md new file mode 100644 index 000000000..558b53e5a --- /dev/null +++ b/docs/superpowers/specs/2026-06-14-search-phase2-tracking-prompt.md @@ -0,0 +1,41 @@ +# Agent prompt: add tracking todo for search Phase 2 (Whoosh→Tantivy date queries) + +Paste the block below to the paperless-ngx agent. It is self-contained. + +--- + +**Add a tracking todo for search Phase 2 (Whoosh→Tantivy date queries).** + +Context discovered while reviewing the upstream `tantivy-py` library against +`docs/superpowers/specs/2026-06-14-search-query-translation-design.md`: + +- §9 of that spec describes an "upstream tantivy-py contribution" needed before Phase 2: + making Python `datetime` objects work in `Query.range_query` / `Query.term_query` on + `Date` fields. +- **That contribution is already implemented on `tantivy-py` `master`** — it just postdates + the released `0.26.0` wheel the spec was tested against. Two commits close the gap: + **#655** (`feat: support unbounded range queries via None bounds`) and **#666** + (`fix: add_date loses tzinfo`, which added the `PyDateTime → tantivy DateTime` converter + and routed both `range_query` and `term_query` through it). `range_query` with `datetime` + (incl. `None` open bounds) and `term_query`/`term_set_query` with `datetime` on `Date` + fields were all verified working, and regression tests were added upstream. +- So the Phase 2 blocker is **no longer a code contribution** — it is simply **a released + `tantivy-py` version newer than the current `0.26.0` wheel that includes #655 + #666**, + plus the dependency bump on our side. + +Please create a tracking todo (in whatever issue/todo system this repo uses) capturing: + +1. **Title:** "Unblock search Phase 2: bump tantivy-py once a release includes datetime query + support (#655 + #666)." +2. **Trigger:** A `tantivy-py` release > the current `0.26.0` wheel containing both commits is + published to PyPI. +3. **Action when unblocked:** Bump the `tantivy-py` pin, then execute Phase 2 from the design + doc — replace Phase 1's string-sentinel open bounds (`0001-01-01…Z` / `9999-12-31…Z`) and + degenerate no-match ranges with real `tantivy.Query` objects (`range_query(..., None)` for + open bounds, `empty_query()` for no-match). +4. **Doc update:** Note in §8/§9 of + `docs/superpowers/specs/2026-06-14-search-query-translation-design.md` that the upstream + code already exists on master and only a release + bump remains. + +Do not start Phase 2 implementation now — this is only a tracking todo. Confirm the current +pinned `tantivy-py` version in our dependency files when writing it.