mirror of
https://github.com/paperless-ngx/paperless-ngx.git
synced 2026-06-30 09:14:17 +00:00
More done work
This commit is contained in:
committed by
stumpylog
parent
60e4715a00
commit
b47cc92b29
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,407 @@
|
||||
# Design: Whoosh→Tantivy Advanced-Query Translation Layer
|
||||
|
||||
**Date:** 2026-06-14
|
||||
**Status:** Phase 1 implemented on branch `fix/search-query-translation` (string-pipeline translation layer in `_translate.py`/`_dates.py`, wired into `parse_user_query`). Phase 2 (Query objects) remains gated on the tantivy-py release noted in §8/§9. Plan: `docs/superpowers/plans/2026-06-14-search-query-translation.md`.
|
||||
**Branch context:** `beta`. Search code: `src/documents/search/`.
|
||||
**Related:** `SEARCH_TANTIVY_WHOOSH_COMPAT.md` (repo root) — full empirical gap matrix and reproduction harnesses. Open branch `fix/scope-comma-expansion` (commit `d8fa97232`) — partial comma fix this design subsumes.
|
||||
|
||||
---
|
||||
|
||||
## 1. Problem
|
||||
|
||||
Paperless migrated full-text search from Whoosh (v2) to Tantivy (v3, commit `aed9abe48`, #12471). A
|
||||
compatibility layer in `_query.py` rewrites old Whoosh query syntax into Tantivy syntax via a stack of
|
||||
ordered regex substitutions before calling `tantivy.Index.parse_query`.
|
||||
|
||||
That regex stack is piecemeal and has hit its complexity ceiling:
|
||||
|
||||
- **No structural awareness.** It runs regex on a flat string, so it cannot distinguish a comma inside
|
||||
`[...]` from a top-level clause separator, or know whether a `:` is a field prefix or text. This causes
|
||||
real bugs (e.g. `title:x,created:[2020 TO 2021]` rewrites to malformed `title:x AND title:created:[...]`).
|
||||
- **Order-dependence.** Six rewriters with implicit ordering contracts (14-digit before 8-digit, year-range
|
||||
before 8-digit, etc.). Each new date form means reasoning about all interactions again.
|
||||
|
||||
The result is a class of v2-valid queries that now return **HTTP 400**. There is no fallback: any syntax
|
||||
Tantivy rejects raises out of `parse_query`, propagates through `_backend.py` (no try/except), and is caught
|
||||
by the generic handler in `views.py:2471-2475` → `HttpResponseBadRequest`, with the real error only in logs.
|
||||
|
||||
### Confirmed regressions (empirically reproduced; full table in `SEARCH_TANTIVY_WHOOSH_COMPAT.md` §5)
|
||||
|
||||
| Class | Example | Today | Whoosh v2 |
|
||||
| ------------------------ | -------------------------------------------------------------- | ---------------------- | --------------------------- |
|
||||
| Bare date on date field | `created:2020`, `created:202003` | 400 | full-year / full-month span |
|
||||
| Bracketed absolute range | `created:[20200101 TO 20201231]`, `[2020-01-01 TO 2020-12-31]` | 400 | floor/ceil range |
|
||||
| Open-ended range | `created:[2020 to]`, `created:[to 2020]` | 400 | `>=` / `<=` range |
|
||||
| Comma between clauses | `title:x,created:[...]` | 400 (malformed) | AND, both sides |
|
||||
| Comma value-list scope | `tag:foo,type:bar` | wrong (`tag:type:bar`) | `tag:foo AND type:bar` |
|
||||
| Invalid date | `created:202023` | 400 | NullQuery (no-match) |
|
||||
|
||||
---
|
||||
|
||||
## 2. Goals / Non-goals
|
||||
|
||||
**Goals**
|
||||
|
||||
- Eliminate the date- and comma-class 400s by translating those forms to valid Tantivy syntax.
|
||||
- Replace the order-dependent regex stack with a structural, context-aware pass.
|
||||
- Match empirically-verified Whoosh v2 semantics (see §3).
|
||||
- Additive tests: existing suite stays green during transition.
|
||||
- **Field-name aliasing for the four renamed Whoosh→Tantivy fields** (added to scope 2026-06-14):
|
||||
`type`→`document_type`, `type_id`→`document_type_id`, `path`→`storage_path`, `path_id`→`storage_path_id`.
|
||||
These are the only fields the Tantivy migration renamed; v2 queries using the old names currently 400.
|
||||
Both old and new spellings work after aliasing (new names pass through verbatim). The alias targets are the
|
||||
text "name" fields (`document_type` is populated from `document_type.name`), so `type:invoice` →
|
||||
`document_type:invoice` is correct. Fields with no Tantivy equivalent (`owner`, the `has_*` booleans,
|
||||
`is_shared`, `custom_field_count`, `custom_fields_id`) are NOT aliased and remain out of scope.
|
||||
|
||||
**Non-goals (explicitly out of scope)**
|
||||
|
||||
- Full Whoosh query-language parity.
|
||||
- Other Whoosh divergences: unknown-field-degrades-to-text (`http://x/a,b` → 400 on the `http:` unknown
|
||||
field), tolerant unbalanced parens, case-insensitive `AND/OR/NOT`. These pass through to Tantivy unchanged
|
||||
and are recorded as separate, known gaps (§10).
|
||||
- `>`/`<`/`>=`/`<=` comparison operators — never supported in paperless-Whoosh (no `GtLtPlugin`); adding them
|
||||
would be a new feature, not a compat fix.
|
||||
|
||||
---
|
||||
|
||||
## 3. Empirical ground truth (verified, not inferred)
|
||||
|
||||
Both engines were run directly; do not regress these without re-checking.
|
||||
|
||||
**Whoosh v2** (paperless's exact `MultifieldParser([...]) + DateParserPlugin(basedate=...)` setup):
|
||||
|
||||
- `created:2020` → `DateRange(2020-01-01 .. 2020-12-31 23:59:59)`; `created:202003` → March 2020.
|
||||
- `created:202023` (month 23) → `<_NullQuery>` — **invalid dates match nothing, never error.**
|
||||
- `created:[202001 TO 202006]` → floor/ceil partial-date bounds; `[2020 to]` / `[to 2020]` → open bounds.
|
||||
- `created:-1week` → an exact-microsecond `Term` — parsed but matches ~nothing (useless in v2).
|
||||
- Comma = AND between clauses, both preserved: `created:[r],added:[r]`, `correspondent:acme,created:[...]`,
|
||||
`invoice,created:2020`.
|
||||
- Comma value-list **only** for `KEYWORD(commas=True)` fields (`tag`, `tag_id`, `viewer_id`):
|
||||
`tag:a,b` → `tag:a AND tag:b`. Text-field commas (`correspondent:foo,bar`, `title:10,20`) are split by the
|
||||
field **analyzer** at parse time, not the comma plugin.
|
||||
- `title:x,created:[...]` → only the DateRange (Whoosh drops `title:x`) — a v2 free-mode **bug**; the correct
|
||||
target keeps both sides.
|
||||
|
||||
**Tantivy 0.26.0** (`tantivy v0.26.0, index_format v7`):
|
||||
|
||||
- Date fields require RFC3339 (`...Z`) literals; rejects bare `2020`, `20200101`, `2020-01-01`, lowercase
|
||||
open ranges.
|
||||
- Text-field commas parse fine verbatim (`correspondent:foo,bar`, `title:10,20`, `content:a,b,c`).
|
||||
- Boolean/paren/phrase structure parses correctly, so a translated date token can sit anywhere:
|
||||
`created:[...Z TO ...Z] OR foo` and `(created:[...] OR foo)` both parse.
|
||||
- String date sentinels `0001-01-01T00:00:00Z` and `9999-12-31T23:59:59Z` both parse on a date field.
|
||||
|
||||
---
|
||||
|
||||
## 4. Architecture (Approach 1: flat tokenizing scanner + single date translator)
|
||||
|
||||
The scanner specializes only the date/comma tokens and treats everything else (operators, parens, phrases,
|
||||
words, wildcards) as opaque passthrough. Tantivy keeps doing boolean/grouping/phrase parsing. A `field:value`
|
||||
span is locally recognizable regardless of surrounding boolean context, so the scanner needs no understanding
|
||||
of `AND/OR/NOT`.
|
||||
|
||||
### 4.1 Module layout
|
||||
|
||||
New module `src/documents/search/_translate.py` — single source of truth:
|
||||
|
||||
```
|
||||
translate_query(raw: str, tz) -> str # top-level: scan → transform → recombine
|
||||
scan(raw) -> list[Token] # depth-aware char-walk tokenizer
|
||||
_resolve_commas(tokens) -> list[Token] # comma → AND / value-list / literal
|
||||
translate_date_value(field, raw, tz) -> str # shape-dispatch date translator
|
||||
```
|
||||
|
||||
Date-boundary math (`_date_only_range`, `_datetime_range`, floor/ceil helpers) **moves** from `_query.py`
|
||||
into `_translate.py` (or a small shared `_dates.py`) so there is one home. The existing math is reused
|
||||
verbatim — not rewritten.
|
||||
|
||||
### 4.2 Data flow
|
||||
|
||||
```
|
||||
parse_user_query(raw, tz)
|
||||
→ translate_query(raw, tz) # NEW pipeline
|
||||
→ index.parse_query(translated, DEFAULT_SEARCH_FIELDS, field_boosts=_FIELD_BOOSTS)
|
||||
```
|
||||
|
||||
### 4.3 Transition (delegate + planned removal)
|
||||
|
||||
- `rewrite_natural_date_keywords` and `normalize_query` become thin delegators to `translate_query` (or its
|
||||
sub-steps) so their existing assertions still pass.
|
||||
- The plan **explicitly schedules deleting both functions and their string-output tests** once
|
||||
`test_translate.py` covers them. Single source of truth, no lingering dead code.
|
||||
|
||||
### 4.4 Safety net
|
||||
|
||||
`parse_user_query` wraps `translate_query` in try/except. On any unexpected scanner error it falls back to the
|
||||
**raw** query string (today's behavior) and logs a warning. The new layer can never regress below current
|
||||
behavior; worst case equals the status quo.
|
||||
|
||||
---
|
||||
|
||||
## 5. Scanner token model
|
||||
|
||||
`scan()` is a single left-to-right char walk tracking **quote state** and **`[]`/`{}` bracket depth**. Token
|
||||
kinds:
|
||||
|
||||
- **`FieldValue(field, value)`** — `field:value`, value a single bare token (no brackets). Recognized when,
|
||||
outside quotes/brackets, it sees `\w+:` followed by a non-bracket value. Value runs until whitespace, a
|
||||
resolved clause-comma, `)`, or end (may itself be quoted: `correspondent:"A B"`).
|
||||
- **`FieldValueList(field, [v1, v2, …])`** — value-list, **only** for `field ∈ {tag, tag_id, viewer_id}`. A
|
||||
`FieldValue` whose value is immediately followed by `,term` runs with **no spaces and no colon** in the
|
||||
continuation terms. The no-colon rule fixes `tag:foo,type:bar` (the `type:bar` is not swallowed).
|
||||
- **`FieldRange(field, open, lo, hi, close)`** — `field:[lo TO hi]` / `{…}`. Split on case-insensitive
|
||||
`TO`; `lo`/`hi` may be empty (open). Consumed to the matching close bracket.
|
||||
- **`Comma`** — emitted only when a depth-0 comma resolves to a clause separator (see §7).
|
||||
- **`Passthrough(raw)`** — everything else, byte-for-byte: operators (`AND OR NOT + -`), parens, bare words,
|
||||
wildcards, phrases/quoted spans, whitespace.
|
||||
|
||||
**Key properties**
|
||||
|
||||
- `field:value` is recognized at any paren depth but **never inside `[]`/`{}` or quotes** — so
|
||||
`(created:2020 OR foo)` still finds the date token, and commas inside `[2020 TO 2021]` or `"a,b"` are never
|
||||
clause separators.
|
||||
- Only date fields (`created`, `modified`, `added`) trigger date translation. Every other `field:value` /
|
||||
`field:range` (`tag:`, `asn:`, unknown fields) and every `Passthrough` is re-emitted verbatim — preserving
|
||||
queries Tantivy already handles.
|
||||
- Multi-valued set is exactly `{tag, tag_id, viewer_id}`. `custom_fields` is now a JSON structure in the index
|
||||
(Whoosh smashed it into a comma-keyword field; the JSON path handles it better) and is **not** comma-split.
|
||||
|
||||
---
|
||||
|
||||
## 6. `translate_date_value` — shape dispatch
|
||||
|
||||
One entry point per token type, both emitting `field:[<ISO-Z> TO <ISO-Z>]`. `created` uses date-only
|
||||
(UTC-midnight) boundaries; `added`/`modified` use local-tz-midnight→UTC. All boundary math reuses the
|
||||
existing tested helpers.
|
||||
|
||||
### Scalar value (`FieldValue` on a date field)
|
||||
|
||||
| Shape | Example | Result | Status |
|
||||
| ----------------------- | ---------------------------------- | ------------------------------------------------------------- | ----------- |
|
||||
| Keyword (opt. quoted) | `created:today`, `"previous week"` | existing keyword ranges | works today |
|
||||
| 4-digit `YYYY` | `created:2020` | full-year span, emitted as `[2020-01-01T…Z TO 2021-01-01T…Z]` | NEW |
|
||||
| 6-digit `YYYYMM` | `created:202003` | month span | NEW |
|
||||
| 8-digit `YYYYMMDD` | `created:20200101` | day span | works today |
|
||||
| 14-digit | `…120000` | exact-second point `[t TO t]` | works today |
|
||||
| ISO dashed | `created:2020-01`, `2020-01-01` | strip separators → digit-precision span | NEW |
|
||||
| Bare relative `-N unit` | `created:-1week` | `[t TO t]` instant (effectively no-match, matches v2) | NEW (P3) |
|
||||
| Invalid / unparsable | `created:202023` | **no-match clause, never 400** | NEW |
|
||||
|
||||
### Range (`FieldRange`)
|
||||
|
||||
Parse each bound with the same shape parser, then `floor(lo)` / `ceil(hi)`:
|
||||
|
||||
- Partial / ISO / 8-digit / 14-digit bounds: `[202001 TO 202006]`, `[2020-01-01 TO 2020-12-31]` — NEW.
|
||||
- `now` bound: `[20200101 TO now]` — NEW.
|
||||
- Open bound (empty side): `[2020 to]`, `[to 2020]` → sentinel far-past floor / far-future ceil (§8) — NEW.
|
||||
- Relative bound: generalize existing `[-N unit to now]` so `-N unit` works on either side.
|
||||
- Reversed (`lo>hi`): swap (existing year-range `min/max` + Whoosh `disambiguated` behavior).
|
||||
- Bare year range `[2005 to 2009]`: unchanged (works today).
|
||||
|
||||
**Boundary convention:** keep the existing "ceil = start of next period, inclusive bracket" (e.g.
|
||||
`[2005-01-01 .. 2010-01-01]`) that current tests encode. Do not switch to Whoosh's `23:59:59.999999`; document
|
||||
the one-instant boundary difference.
|
||||
|
||||
---
|
||||
|
||||
## 7. Comma resolution
|
||||
|
||||
A depth-0 comma is resolved three ways (this single rule set subsumes both `fix/scope-comma-expansion` and
|
||||
the unstaged `]`/`"` fix, and fixes Gap E):
|
||||
|
||||
1. **Value-list** — preceding token is a `FieldValue`/`FieldValueList` on `{tag, tag_id, viewer_id}` and the
|
||||
following continuation is a bare, colon-free term → repeat the field: `tag:a,b,c` → `tag:a AND tag:b AND tag:c`.
|
||||
2. **Clause separator → `AND`** — fires only at a structured boundary:
|
||||
- (a) the comma is preceded by a closing `]` or `"` (`created:[r],added:[r]`, `correspondent:"A B",created:[r]`), or
|
||||
- (b) the comma is followed by a **known schema** `field:` (`title:foo,created:[r]`, `correspondent:foo,created:[r]`).
|
||||
Requiring a _known_ field for (b) prevents `http://x,…`-style misfires.
|
||||
3. **Literal** — anything else (a comma followed by a bare term on a non-multivalue field) stays in place:
|
||||
`correspondent:foo,bar`, `title:10,20`, URLs. Tantivy's analyzer tokenizes these on punctuation, matching
|
||||
Whoosh's analyzer behavior.
|
||||
|
||||
---
|
||||
|
||||
## 8. Open-range handling & the two phases
|
||||
|
||||
**Phase 1 (this work) — string output, no tantivy change.**
|
||||
Open bounds use verified string sentinels: lower-open → `0001-01-01T00:00:00Z`, upper-open → `9999-12-31T23:59:59Z`
|
||||
(both confirmed to parse on a date field in 0.26.0). No-match (invalid date) uses a degenerate date range
|
||||
(exact representation flagged for verification in §11).
|
||||
|
||||
**Phase 2 (stretch) — build `tantivy.Query` objects for date clauses.**
|
||||
`Query.range_query(..., lower_bound=None/upper_bound=None)` gives true open bounds and `empty_query()` gives a
|
||||
real no-match, eliminating all string hacks. **Gated only on a released `tantivy-py` > 0.26.0 that includes
|
||||
#655 + #666 — the code already exists on `tantivy-py` `master`, it just postdates the `0.26.0` wheel we pin
|
||||
(`pyproject.toml`: `tantivy~=0.26.0`); see §9.** Splicing a Query object into an otherwise-string boolean query
|
||||
is non-trivial, so Phase 2 is a separate, later effort; Phase 1 ships independently.
|
||||
|
||||
Phase 2 also folds in the deferred Phase-1 cleanup (maintainer decision, 2026-06-15):
|
||||
|
||||
- Replace the `NO_MATCH` degenerate-range sentinel with `Query.empty_query()` (this also retires the cosmetic
|
||||
issue that `NO_MATCH` always names the `created` field regardless of the queried field).
|
||||
- Replace `OPEN_LO`/`OPEN_HI` string sentinels with `range_query(..., None)` open bounds.
|
||||
- Retire the now-dead `_rewrite_*` helpers and the `rewrite_natural_date_keywords`/`normalize_query` delegation
|
||||
shims in `_query.py` (~160 lines left from the Phase-1 transition), and migrate their string-output tests in
|
||||
`test_query.py` (replace the direct `_rewrite_compact_date` test with a `translate_scalar` test).
|
||||
|
||||
---
|
||||
|
||||
## 9. Upstream tantivy-py contribution (PR-ready detail)
|
||||
|
||||
> **STATUS UPDATE (2026-06-14): already implemented upstream on `master`.** The date-value gap below is
|
||||
> closed by two merged `tantivy-py` commits that postdate the released `0.26.0` wheel we pin:
|
||||
> **#655** (`feat: support unbounded range queries via None bounds`) and **#666** (`fix: add_date loses
|
||||
tzinfo`, which added the `PyDateTime → tantivy DateTime` converter and routed both `range_query` and
|
||||
> `term_query` through it). `range_query` with `datetime` (incl. `None` open bounds) and
|
||||
> `term_query`/`term_set_query` with `datetime` on `Date` fields are verified working upstream with
|
||||
> regression tests. **The Phase 2 blocker is therefore no longer a code contribution** — it is only a
|
||||
> published `tantivy-py` release > `0.26.0` containing #655 + #666, plus bumping our pin
|
||||
> (`pyproject.toml`: `tantivy~=0.26.0`). The PR-ready detail below is retained as the historical record of
|
||||
> the gap as observed against `0.26.0`.
|
||||
|
||||
**Repo:** `quickwit-oss/tantivy-py`. **Observed version:** `0.26.0` (`tantivy v0.26.0, index_format v7`).
|
||||
|
||||
**Gap.** Python `datetime` objects cannot be passed to _any_ Query constructor for a `Date` field. Both
|
||||
`Query.range_query` and `Query.term_query` reject them:
|
||||
|
||||
```
|
||||
Expected DateTime type for field created, got datetime.datetime(2020, 1, 1, 0, 0, tzinfo=datetime.timezone.utc)
|
||||
```
|
||||
|
||||
Int timestamps (seconds and nanoseconds) are also rejected, and there is no exposed/constructible
|
||||
`tantivy.DateTime` (`hasattr(tantivy, "DateTime") is False`). Consequently **all** date querying in paperless
|
||||
goes through `parse_query` strings; every object-mode `term_query` in the codebase is on integer fields
|
||||
(`id`, `owner_id`, `viewer_id`).
|
||||
|
||||
**Context.** PR #655 (merged 2026-04-27) added unbounded (`None`) bounds to `range_query`. That solved open
|
||||
_bounds_ but left the date _value_ path unusable from Python, so the open-range feature can't actually be used
|
||||
on date fields from Python yet.
|
||||
|
||||
**Reproduction** (against installed 0.26.0):
|
||||
|
||||
```python
|
||||
import tantivy
|
||||
from datetime import datetime, UTC
|
||||
schema = build_schema() # any schema with a date field "created"
|
||||
dt1, dt2 = datetime(2020,1,1,tzinfo=UTC), datetime(2021,1,1,tzinfo=UTC)
|
||||
|
||||
tantivy.Query.range_query(schema, "created", tantivy.FieldType.Date, lower_bound=dt1, upper_bound=dt2)
|
||||
# -> ValueError: Expected DateTime type for field created, got datetime.datetime(...)
|
||||
|
||||
tantivy.Query.range_query(schema, "created", tantivy.FieldType.Date, lower_bound=dt1, upper_bound=None)
|
||||
# -> same error (open bound is fine; the date VALUE is the problem)
|
||||
|
||||
tantivy.Query.term_query(schema, "created", dt1)
|
||||
# -> same error
|
||||
```
|
||||
|
||||
**Proposed fix (preferred):** in the Rust binding, when the target field is `Date`, accept a Python
|
||||
`datetime` and convert internally to `tantivy::DateTime` (e.g. `DateTime::from_timestamp_nanos(...)`), mirroring
|
||||
the conversion the indexing path already performs when adding date values to a document (document add-date
|
||||
already accepts `PyDateTime`). This makes `range_query`/`term_query` consistent with indexing. The value-coercion
|
||||
lives in the Query-construction value handling (the term/bound extraction in the query bindings, e.g.
|
||||
`src/query.rs`); reuse the existing `PyDateTime → tantivy DateTime` converter from the document bindings rather
|
||||
than adding a new one. Confirm exact locations against the tantivy-py source at PR time.
|
||||
|
||||
**Alternative:** expose a constructible `tantivy.DateTime` (from a Python `datetime` or an epoch-nanos int) and
|
||||
accept it in `range_query`/`term_query`. Less ergonomic; only do this if reusing the indexing converter proves
|
||||
awkward.
|
||||
|
||||
**Validation for the PR:**
|
||||
|
||||
- `range_query` on a `Date` field with two `datetime` bounds builds and returns expected hits.
|
||||
- `range_query` with one `datetime` bound and one `None` (open) works on a `Date` field.
|
||||
- `term_query` on a `Date` field with a `datetime` builds and matches.
|
||||
- Round-trip: index a doc with a known date, query it back via both closed and open ranges.
|
||||
|
||||
When this lands and we bump tantivy-py to the release containing it, Phase 2 (§8) becomes unblocked.
|
||||
|
||||
---
|
||||
|
||||
## 10. Out of scope / known separate gaps
|
||||
|
||||
- **Unknown-field 400.** `http://example.com/a,b` → `Field does not exist: 'http'`. Tantivy treats `http:` as
|
||||
a field; Whoosh's `remove_unknown=True` degraded unknown fields to text. This is the unknown-field divergence,
|
||||
not a comma or date issue. Recorded, not fixed here.
|
||||
- `>`/`<`/`>=`/`<=` comparisons — never supported in paperless-Whoosh.
|
||||
- Bare relative scalar (`created:-1week`) is P3: it "worked" in v2 but matched nothing. We only guarantee
|
||||
no-400.
|
||||
|
||||
---
|
||||
|
||||
## 11. Items to verify during implementation
|
||||
|
||||
- Exact RFC3339 **open-bound sentinels** to standardize on (`0001-01-01T00:00:00Z` / `9999-12-31T23:59:59Z`
|
||||
both parse; confirm they also behave in actual searches, not just parsing).
|
||||
- The **no-match clause** string representation for a date field (a degenerate/empty range that parses but
|
||||
matches nothing). In Phase 2 this becomes `empty_query()`.
|
||||
- ISO-dashed precision handling parity with Whoosh's separator-stripping (`-`, `.`, space).
|
||||
- Coordination with `fix/scope-comma-expansion`: either land this after that branch merges and delete its
|
||||
now-redundant regex, or absorb its narrowing directly. Do not ship both comma implementations.
|
||||
|
||||
---
|
||||
|
||||
## 12. Test plan (additive)
|
||||
|
||||
- **`test_translate.py` (new):**
|
||||
- `scan()` token-sequence tests: quotes, brackets, parens, URLs, value-lists, mixed clauses.
|
||||
- `translate_date_value` shape table: every §6 row (scalar + range), all three date fields,
|
||||
UTC/Eastern/Auckland timezones (reuse existing tz test patterns).
|
||||
- comma resolution: value-list (`tag`/`tag_id`/`viewer_id`), clause-sep (after `]`/`"`, before known
|
||||
`field:`), literal (text fields, URLs, `title:10,20`).
|
||||
- `translate_query()` golden cases: the full §3 / report-§5b ground-truth matrix.
|
||||
- **Parse-acceptance guardrail (current tests lack this):** for every golden case assert
|
||||
`index.parse_query(translate_query(q))` does not raise, against a real index.
|
||||
- **End-to-end:** a `views.py` search test asserting previously-400 v2 queries (`created:2020`,
|
||||
`created:[20200101 TO 20201231]`, `title:x,created:[…]`) now return 200.
|
||||
- Existing tests stay green via delegation; on removal of the old functions, migrate any unique assertions
|
||||
into `test_translate.py`.
|
||||
|
||||
---
|
||||
|
||||
## 13. Verification harnesses (keep for regression / ground-truth regeneration)
|
||||
|
||||
**Tantivy side** (does a translated string parse?):
|
||||
|
||||
```bash
|
||||
cd src && PAPERLESS_SECRET_KEY=x uv run python -c "
|
||||
import django, os, tempfile
|
||||
os.environ.setdefault('DJANGO_SETTINGS_MODULE','paperless.settings'); django.setup()
|
||||
import tantivy
|
||||
from documents.search._schema import build_schema
|
||||
from documents.search._tokenizer import register_tokenizers
|
||||
from documents.search._query import DEFAULT_SEARCH_FIELDS, _FIELD_BOOSTS
|
||||
idx = tantivy.Index(build_schema(), path=tempfile.mkdtemp()); register_tokenizers(idx,'english')
|
||||
idx.parse_query('<translated string>', DEFAULT_SEARCH_FIELDS, field_boosts=_FIELD_BOOSTS)
|
||||
"
|
||||
```
|
||||
|
||||
**Whoosh side** (what did v2 do? — ground truth):
|
||||
|
||||
```bash
|
||||
uv run --with cached_property python3 -W ignore -c "
|
||||
import sys; sys.path.insert(0,'whoosh/src')
|
||||
from datetime import datetime
|
||||
from whoosh.fields import Schema, TEXT, DATETIME, KEYWORD
|
||||
from whoosh.qparser import MultifieldParser
|
||||
from whoosh.qparser.dateparse import DateParserPlugin
|
||||
schema = Schema(title=TEXT(), content=TEXT(), correspondent=TEXT(),
|
||||
tag=KEYWORD(commas=True, lowercase=True), tag_id=KEYWORD(commas=True), viewer_id=KEYWORD(commas=True),
|
||||
type=TEXT(), created=DATETIME(), added=DATETIME(), modified=DATETIME(), notes=TEXT(), custom_fields=TEXT())
|
||||
qp = MultifieldParser(['content','title','correspondent','tag','type','notes','custom_fields'], schema)
|
||||
qp.add_plugin(DateParserPlugin(basedate=datetime(2026,6,14,14,0,0)))
|
||||
print(qp.parse('<query>'))
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 14. Phased summary
|
||||
|
||||
- **Phase 1 (now):** `_translate.py` scanner + `translate_date_value`, string output, sentinel open bounds,
|
||||
delegation shims, additive tests, parse-acceptance guardrail, end-to-end 400→200 tests. Ships on tantivy
|
||||
0.26.0, no upstream dependency. Subsumes `fix/scope-comma-expansion`.
|
||||
- **Phase 2 (later, gated on §9 upstream):** build `tantivy.Query` objects for date clauses — true open ranges
|
||||
via `range_query(None)`, real no-match via `empty_query()`, no string sentinels. Requires the tantivy-py
|
||||
date-value contribution and a version bump.
|
||||
@@ -0,0 +1,41 @@
|
||||
# Agent prompt: add tracking todo for search Phase 2 (Whoosh→Tantivy date queries)
|
||||
|
||||
Paste the block below to the paperless-ngx agent. It is self-contained.
|
||||
|
||||
---
|
||||
|
||||
**Add a tracking todo for search Phase 2 (Whoosh→Tantivy date queries).**
|
||||
|
||||
Context discovered while reviewing the upstream `tantivy-py` library against
|
||||
`docs/superpowers/specs/2026-06-14-search-query-translation-design.md`:
|
||||
|
||||
- §9 of that spec describes an "upstream tantivy-py contribution" needed before Phase 2:
|
||||
making Python `datetime` objects work in `Query.range_query` / `Query.term_query` on
|
||||
`Date` fields.
|
||||
- **That contribution is already implemented on `tantivy-py` `master`** — it just postdates
|
||||
the released `0.26.0` wheel the spec was tested against. Two commits close the gap:
|
||||
**#655** (`feat: support unbounded range queries via None bounds`) and **#666**
|
||||
(`fix: add_date loses tzinfo`, which added the `PyDateTime → tantivy DateTime` converter
|
||||
and routed both `range_query` and `term_query` through it). `range_query` with `datetime`
|
||||
(incl. `None` open bounds) and `term_query`/`term_set_query` with `datetime` on `Date`
|
||||
fields were all verified working, and regression tests were added upstream.
|
||||
- So the Phase 2 blocker is **no longer a code contribution** — it is simply **a released
|
||||
`tantivy-py` version newer than the current `0.26.0` wheel that includes #655 + #666**,
|
||||
plus the dependency bump on our side.
|
||||
|
||||
Please create a tracking todo (in whatever issue/todo system this repo uses) capturing:
|
||||
|
||||
1. **Title:** "Unblock search Phase 2: bump tantivy-py once a release includes datetime query
|
||||
support (#655 + #666)."
|
||||
2. **Trigger:** A `tantivy-py` release > the current `0.26.0` wheel containing both commits is
|
||||
published to PyPI.
|
||||
3. **Action when unblocked:** Bump the `tantivy-py` pin, then execute Phase 2 from the design
|
||||
doc — replace Phase 1's string-sentinel open bounds (`0001-01-01…Z` / `9999-12-31…Z`) and
|
||||
degenerate no-match ranges with real `tantivy.Query` objects (`range_query(..., None)` for
|
||||
open bounds, `empty_query()` for no-match).
|
||||
4. **Doc update:** Note in §8/§9 of
|
||||
`docs/superpowers/specs/2026-06-14-search-query-translation-design.md` that the upstream
|
||||
code already exists on master and only a release + bump remains.
|
||||
|
||||
Do not start Phase 2 implementation now — this is only a tracking todo. Confirm the current
|
||||
pinned `tantivy-py` version in our dependency files when writing it.
|
||||
Reference in New Issue
Block a user