From dcbac2b090f012ed9892b0cc2222e9f38284bf3f Mon Sep 17 00:00:00 2001 From: Trenton Holmes <797416+stumpylog@users.noreply.github.com> Date: Mon, 15 Jun 2026 11:00:05 -0700 Subject: [PATCH] Tracks the tanvity update blocker --- ...026-06-14-search-phase2-tracking-prompt.md | 0 ...6-06-15-search-query-translation-design.md | 129 ++++++++++++++++++ 2 files changed, 129 insertions(+) rename docs/superpowers/{ => done}/specs/2026-06-14-search-phase2-tracking-prompt.md (100%) create mode 100644 docs/superpowers/specs/2026-06-15-search-query-translation-design.md diff --git a/docs/superpowers/specs/2026-06-14-search-phase2-tracking-prompt.md b/docs/superpowers/done/specs/2026-06-14-search-phase2-tracking-prompt.md similarity index 100% rename from docs/superpowers/specs/2026-06-14-search-phase2-tracking-prompt.md rename to docs/superpowers/done/specs/2026-06-14-search-phase2-tracking-prompt.md diff --git a/docs/superpowers/specs/2026-06-15-search-query-translation-design.md b/docs/superpowers/specs/2026-06-15-search-query-translation-design.md new file mode 100644 index 000000000..00e34686f --- /dev/null +++ b/docs/superpowers/specs/2026-06-15-search-query-translation-design.md @@ -0,0 +1,129 @@ +# Whoosh→Tantivy Advanced-Query Translation — Design + +**Date:** 2026-06-15 +**Branch base:** `dev` (Phase 1 implemented on `fix/search-query-translation`) +**Status:** Phase 1 implemented; Phase 2 pending an upstream `tantivy-py` release (see below) + +## Problem + +The Tantivy search migration changed the advanced-query (`?query=`) syntax contract. +A class of queries that worked under the old Whoosh backend now return an opaque +**HTTP 400**, because the old query string is handed to `tantivy.Index.parse_query` +which rejects forms Whoosh accepted. There is no fallback: a parse error propagates +through `documents/search/_backend.py` and is caught by the generic handler in +`documents/views.py` → `HttpResponseBadRequest`, with the real error only in logs. + +Affected query forms (all verified against real Whoosh and real Tantivy): + +- Bare dates on a date field: `created:2020`, `created:202003`. +- Bracketed absolute/partial/ISO ranges: `created:[20200101 TO 20201231]`, + `created:[2020-01-01 TO 2020-12-31]`, `created:[202001 TO 202006]`. +- Open-ended ranges: `created:[2020 to]`, `created:[to 2020]`. +- Relative ranges: `added:[-1 week to now]`, `added:[now-7d TO now]`. +- Comma-joined clauses and value lists: `created:[r],added:[r]`, `tag:foo,bar`, + the malformed `title:x,created:[…]`. +- Renamed fields: `type:`/`path:` (Whoosh names) vs `document_type:`/`storage_path:` + (Tantivy names). +- Invalid dates (`created:202023`) — Whoosh matched nothing (NullQuery); Tantivy 400s. + +The original compatibility layer was a stack of order-dependent regex substitutions, +which had no structural awareness (could not tell a comma inside `[...]` from a clause +separator) and required a new regex per form. This design replaces it. + +## Approach + +A structural, context-aware translation pass that intercepts only the forms Tantivy +parses differently (dates, commas, renamed fields) and passes everything else +(booleans, grouping, phrases, wildcards) straight through to Tantivy's own parser. + +Pipeline (`documents/search/_translate.py`): + +``` +parse_user_query(raw, tz) + → translate_query(raw, tz) # wrapped in a safety net: on any + → scan(raw) depth-aware tokenizer (quotes / [] depth) + → resolve_commas value-list (tag/tag_id/viewer_id) vs clause separator + → _render date tokens → translate_scalar / translate_range, + field aliasing, comma → AND + → operator normalization (spaced/trailing -/+ cleanup) + → index.parse_query(translated, …) # exception → fall back to raw query +``` + +Date math lives in `documents/search/_dates.py` (no Django deps). The two date-field +semantics are preserved: `created` is date-only (UTC-midnight boundaries); +`added`/`modified` are datetimes (local-tz-midnight → UTC). + +### Verified compatibility contract (from running both engines) + +- Comma between clauses = AND, both sides preserved; comma within a `KEYWORD(commas=True)` + field value = value list. Multi-value fields are exactly `{tag, tag_id, viewer_id}`. +- Invalid/unparsable dates → a no-match clause (never a 400), matching Whoosh's NullQuery. +- Field renames to alias: `type`→`document_type`, `type_id`→`document_type_id`, + `path`→`storage_path`, `path_id`→`storage_path_id`. Both old and new names work. +- Partial-date ranges floor the low bound and ceil the high bound; reversed ranges swap. + +## Phase 1 (implemented) + +Branch `fix/search-query-translation`. The full pipeline above, output as a Tantivy +query **string**, with these workarounds for things the string parser cannot express +on date fields: + +- **Open-ended ranges** use far-past / far-future string sentinels + (`0001-01-01T00:00:00Z` / `9999-12-31T23:59:59Z`). +- **No-match** (unparsable date) uses a degenerate equal-bound date range. + +Status: complete; the full `-m search` suite passes (date forms, comma clauses, field +aliasing, relative ranges, operator normalization, and the existing search tests now +validating the new pipeline). The old `_rewrite_*` regex helpers were left in place as +delegation shims during the transition. + +## Phase 2 (pending — the thing being tracked) + +Replace the Phase-1 string workarounds with real `tantivy.Query` objects for date +clauses, which removes the sentinel/degenerate-range hacks entirely: + +1. **Open bounds** via `Query.range_query(field, FieldType.Date, lower_bound=…, +upper_bound=None)` (and vice-versa) instead of `OPEN_LO`/`OPEN_HI` sentinels. +2. **No-match** via `Query.empty_query()` instead of the degenerate range. This also + fixes the cosmetic issue that the no-match sentinel always names the `created` field. +3. **Retire the dead code**: remove the now-unused `_rewrite_*` helpers and the + `rewrite_natural_date_keywords` / `normalize_query` delegation shims in `_query.py` + (~160 lines left from the Phase-1 transition), and migrate their string-output tests + in `test_query.py` (replace the direct `_rewrite_compact_date` test with a + `translate_scalar` test). + +### Blocker + +Phase 2 is **gated on a published `tantivy-py` release**, not on any further code +contribution. In `tantivy-py 0.26.0` (our current pin: `tantivy~=0.26.0` in +`pyproject.toml`, released 2026-04-29), `range_query`/`term_query` **reject Python +`datetime` values on `Date` fields** (`Expected DateTime type for field …`), so date +Query objects cannot be built from Python. The fix is already merged on `tantivy-py` +`master` across two PRs: + +- **#655** — `feat: support unbounded range queries via None bounds`. +- **#666** — `fix: add_date loses tzinfo` (adds the `PyDateTime → tantivy DateTime` + converter and routes `range_query`/`term_query` through it). + +Both postdate the `0.26.0` wheel. + +- **Trigger:** a `tantivy-py` release `> 0.26.0` containing #655 + #666 is published to PyPI. +- **Action:** bump the `tantivy-py` pin, then do items 1–3 above. + +## Out of scope + +- Unknown-field handling: Whoosh degraded an unknown `field:` to text; Tantivy 400s + (`http://x/a,b` → `Field does not exist: 'http'`). Not a date/comma/rename issue. +- Whoosh fields with no Tantivy equivalent: `owner` (text), the `has_*` presence + booleans, `is_shared`, `custom_field_count`, `custom_fields_id`. +- `>`/`<`/`>=`/`<=` comparisons — never supported in paperless-Whoosh (no `GtLtPlugin`). + +## Reference / how to re-verify + +- Tantivy side (does a translated string parse?): build a real index via + `documents.search._schema.build_schema` + `register_tokenizers`, then + `index.parse_query(translate_query(q, tz), DEFAULT_SEARCH_FIELDS, field_boosts=…)`. +- Whoosh side (what did v2 do?): the old `get_schema()` + `MultifieldParser([...]) + +DateParserPlugin(...)` still exists on `main` (`src/documents/index.py`); run a query + through it to get the ground-truth `Query`. +- A fuller empirical gap matrix lives in `SEARCH_TANTIVY_WHOOSH_COMPAT.md`.