diff --git a/docs/superpowers/plans/2026-06-15-search-error-shapes-followup.md b/docs/superpowers/plans/2026-06-15-search-error-shapes-followup.md new file mode 100644 index 000000000..ff2bca9d8 --- /dev/null +++ b/docs/superpowers/plans/2026-06-15-search-error-shapes-followup.md @@ -0,0 +1,790 @@ +# Search Error Shapes Follow-up Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Replace the generic advanced-search HTTP 400 ("Error listing search results, check logs for more detail.") with three specific, user-fixable `SearchQueryError` subclasses (`UnknownFieldError`, `InvalidFieldValueError`, `MalformedQueryError`). + +**Architecture:** Two detection layers feeding the _existing_ `except SearchQueryError` handler in `UnifiedSearchViewSet.list` (no view change). (1) A **proactive** numeric-value validator inside `translate_query`'s render pass (`_translate.py`) raises `InvalidFieldValueError` before the query reaches Tantivy. (2) A **backstop** wrapper around `index.parse_query` in `parse_user_query` (`_query.py`) maps residual Tantivy `ValueError` message prefixes (`Field does not exist:`, `Syntax Error:`, `Expected a valid integer:`) into the right subclass, so nothing leaks Rust internals or hits the generic 400. + +**Tech Stack:** Python 3.11+, Django, `tantivy` (tantivy-py 0.26.0), `regex`, stdlib `difflib`, pytest + pytest-django. All commands run via `uv run` from `src/`. + +**Spec:** `docs/superpowers/specs/2026-06-15-search-error-shapes-followup-design.md` (read it first). + +**Reference facts (empirically verified 2026-06-15):** + +- Tantivy `index.parse_query` raises `ValueError` with exactly these prefixes: `Field does not exist: ''`, `Syntax Error: `, `Expected a valid integer: 'ParseIntError { kind: InvalidDigit }'`. +- `page_count:>5`, `asn:<10`, `page_count:>=5`, `asn:[1 TO 10]`, `tag_id:1,2,3` parse OK (comparison operators produce correct `RangeQuery`). +- `asn:[1 TO]` / `asn:[TO 10]` are a **Syntax Error** (open numeric ranges unsupported; only open _date_ ranges work via sentinels). +- `scan()` only tokenizes fields in `KNOWN_FIELDS`; unknown `foobar:hello` stays a `Passthrough` and only fails at `parse_query` -> detected by the backstop, not proactively. +- `difflib.get_close_matches("corespondent", pool)` -> `["correspondent"]`; `has_tags`/`http`/`12` -> `[]` (bare message). +- `tantivy.Schema` exposes no field-name list, so the drift guard is parse-based. + +## File Structure + +| File | Responsibility | Change | +| ------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------ | --------------- | +| `src/documents/search/_translate.py` | Error classes, field-set constants, proactive numeric validation in `_render`, Tantivy-error mapper + hint helpers | Modify | +| `src/documents/search/_query.py` | Backstop wrapper around `index.parse_query` in `parse_user_query` | Modify | +| `src/documents/search/__init__.py` | Re-export new error classes for the view import | Modify (verify) | +| `src/documents/tests/search/test_error_shapes.py` | All unit tests for the new behavior (dedicated file per subject) | Create | +| `src/documents/tests/test_api_search.py` | One view-level 400 integration test (mirrors existing `test_search_added_invalid_date`) | Modify | + +**Test command convention:** single-file runs disable xdist: +`cd src && uv run pytest documents/tests/search/test_error_shapes.py --override-ini="addopts=" -v` + +--- + +### Task 1: Error classes and field-set constants + +**Files:** + +- Modify: `src/documents/search/_translate.py` (add `import difflib`; add constants and classes after the existing `InvalidDateQuery` class, around line 337) +- Test: `src/documents/tests/search/test_error_shapes.py` (create) + +- [ ] **Step 1: Write the failing test** + +Create `src/documents/tests/search/test_error_shapes.py`: + +```python +from __future__ import annotations + +import pytest + +from documents.search._translate import FIELD_ALIASES +from documents.search._translate import KNOWN_FIELDS +from documents.search._translate import NUMERIC_FIELDS +from documents.search._translate import SEARCHABLE_FIELDS +from documents.search._translate import InvalidFieldValueError +from documents.search._translate import MalformedQueryError +from documents.search._translate import SearchQueryError +from documents.search._translate import UnknownFieldError + + +@pytest.mark.search +class TestErrorClasses: + def test_all_subclass_search_query_error(self): + assert issubclass(UnknownFieldError, SearchQueryError) + assert issubclass(InvalidFieldValueError, SearchQueryError) + assert issubclass(MalformedQueryError, SearchQueryError) + + def test_unknown_field_message_without_suggestion(self): + err = UnknownFieldError("has_tags") + assert err.field == "has_tags" + assert err.suggestion is None + assert str(err) == "Unknown search field 'has_tags'." + + def test_unknown_field_message_with_suggestion(self): + err = UnknownFieldError("corespondent", suggestion="correspondent") + assert err.suggestion == "correspondent" + assert str(err) == ( + "Unknown search field 'corespondent'. Did you mean 'correspondent'?" + ) + + def test_invalid_field_value_message_with_field(self): + err = InvalidFieldValueError("asn", "notanumber") + assert err.field == "asn" + assert err.value == "notanumber" + assert str(err) == "Field 'asn' expects a number, got 'notanumber'." + + def test_invalid_field_value_generic_message(self): + err = InvalidFieldValueError() + assert "number" in str(err).lower() + assert "ParseIntError" not in str(err) + + def test_malformed_query_message(self): + err = MalformedQueryError("Unbalanced quote in search query.") + assert str(err) == "Unbalanced quote in search query." + + +@pytest.mark.search +class TestFieldSets: + def test_numeric_fields_are_known(self): + assert NUMERIC_FIELDS <= KNOWN_FIELDS + + def test_searchable_excludes_aliases(self): + assert SEARCHABLE_FIELDS == KNOWN_FIELDS - set(FIELD_ALIASES) + # aliases must NOT be suggestable + for alias in FIELD_ALIASES: + assert alias not in SEARCHABLE_FIELDS +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd src && uv run pytest documents/tests/search/test_error_shapes.py --override-ini="addopts=" -v` +Expected: FAIL with `ImportError: cannot import name 'NUMERIC_FIELDS'` (and the other new names). + +- [ ] **Step 3: Write minimal implementation** + +In `src/documents/search/_translate.py`, add `import difflib` to the stdlib import group (after line 2, before `from dataclasses import dataclass`): + +```python +import difflib +``` + +Then, immediately after the `InvalidDateQuery` class (after line 336), add: + +```python +class UnknownFieldError(SearchQueryError): + """Raised when a query scopes on a field name that does not exist.""" + + def __init__(self, field: str, suggestion: str | None = None) -> None: + self.field = field + self.suggestion = suggestion + message = f"Unknown search field {field!r}." + if suggestion: + message += f" Did you mean {suggestion!r}?" + super().__init__(message) + + +class InvalidFieldValueError(SearchQueryError): + """Raised when a numeric field receives a non-numeric value.""" + + def __init__(self, field: str | None = None, value: str | None = None) -> None: + self.field = field + self.value = value + if field is not None and value is not None: + message = f"Field {field!r} expects a number, got {value!r}." + else: + message = "A numeric field in the search query received a non-numeric value." + super().__init__(message) + + +class MalformedQueryError(SearchQueryError): + """Raised for structural syntax errors (unbalanced quotes/brackets, etc.).""" +``` + +Add the field-set constants next to `KNOWN_FIELDS` (after line 92, after the `KNOWN_FIELDS` definition): + +```python +# Numeric (unsigned-int) fields. Values must be integers, optionally prefixed by +# a comparison operator (>, <, >=, <=). Validated proactively in _render. +NUMERIC_FIELDS = frozenset( + { + "asn", + "page_count", + "num_notes", + "correspondent_id", + "document_type_id", + "storage_path_id", + "tag_id", + "owner_id", + "viewer_id", + }, +) + +# Canonical user-facing field names for validation and did-you-mean suggestions. +# Aliases are excluded so a typo is never "corrected" to a deprecated alias. +SEARCHABLE_FIELDS = KNOWN_FIELDS - frozenset(FIELD_ALIASES) +``` + +Note: `SEARCHABLE_FIELDS` references `FIELD_ALIASES`, which is defined above `KNOWN_FIELDS` (line 54), so this ordering is valid. + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd src && uv run pytest documents/tests/search/test_error_shapes.py --override-ini="addopts=" -v` +Expected: PASS (all `TestErrorClasses` and `TestFieldSets` cases green). + +- [ ] **Step 5: Commit** + +```bash +git add src/documents/search/_translate.py src/documents/tests/search/test_error_shapes.py +git commit -m "feat(search): add error-shape classes and field-set constants" +``` + +--- + +### Task 2: Proactive numeric-value validation in `translate_query` + +**Files:** + +- Modify: `src/documents/search/_translate.py` (add `_validate_numeric`; hook into `_render` at lines 484-503) +- Test: `src/documents/tests/search/test_error_shapes.py` + +- [ ] **Step 1: Write the failing test** + +Append to `src/documents/tests/search/test_error_shapes.py`: + +```python +from datetime import UTC + +from documents.search._translate import translate_query + + +@pytest.mark.search +class TestProactiveNumericValidation: + @pytest.mark.parametrize( + ("query", "field", "value"), + [ + ("asn:notanumber", "asn", "notanumber"), + ("num_notes:abc", "num_notes", "abc"), + ("page_count:[foo TO bar]", "page_count", "foo"), + ("tag_id:1,foo", "tag_id", "foo"), + ], + ) + def test_non_numeric_value_raises(self, query, field, value): + with pytest.raises(InvalidFieldValueError) as exc_info: + translate_query(query, UTC) + assert exc_info.value.field == field + assert exc_info.value.value == value + + @pytest.mark.parametrize( + "query", + [ + "asn:5", + "asn:>5", + "asn:<10", + "page_count:>=5", + "page_count:<=5", + "asn:[1 TO 10]", + "tag_id:1,2,3", + "viewer_id:1,2", + "asn:[1 TO]", # open numeric range: passes the integer check here + "asn:[TO 10]", + ], + ) + def test_valid_numeric_values_do_not_raise(self, query): + # Should not raise InvalidFieldValueError. (Open numeric ranges still fail + # later at parse_query as a Syntax Error -> MalformedQueryError, but NOT + # here in the value validator.) + translate_query(query, UTC) + + def test_alias_numeric_field_validated(self): + # type_id is a numeric alias -> document_type_id; must still validate. + with pytest.raises(InvalidFieldValueError): + translate_query("type_id:abc", UTC) +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd src && uv run pytest "documents/tests/search/test_error_shapes.py::TestProactiveNumericValidation" --override-ini="addopts=" -v` +Expected: FAIL — `test_non_numeric_value_raises` cases do not raise (values currently pass through to Tantivy unvalidated). + +- [ ] **Step 3: Write minimal implementation** + +In `src/documents/search/_translate.py`, add a module-level regex near the other operator patterns (after line 510, near `_SPACED_OP_RE`): + +```python +# Leading comparison operator on a numeric value (asn:>5, page_count:<=10). +_COMPARISON_PREFIX_RE = regex.compile(r"^(>=|<=|>|<)") +``` + +Add the validator helper (place it just above `_render`, around line 483): + +```python +def _validate_numeric(field: str, value: str) -> None: + """Raise InvalidFieldValueError if a numeric-field value is not an integer. + + Strips a single leading comparison operator (>, <, >=, <=) and surrounding + quotes first so comparison queries pass. An empty value (open range bound) + is accepted here; an open numeric bracket-range still fails downstream at + parse_query as a Syntax Error, surfaced as MalformedQueryError. + """ + candidate = _COMPARISON_PREFIX_RE.sub("", value.strip().strip("\"'")).strip() + if candidate == "": + return + if not candidate.isdigit(): + raise InvalidFieldValueError(field, value) +``` + +Modify `_render` (lines 490-502) to validate numeric fields. Replace the `FieldValueList`, `FieldValue`, and `FieldRange` branches with: + +```python + if isinstance(tok, FieldValueList): + field = FIELD_ALIASES.get(tok.field, tok.field) + if field in NUMERIC_FIELDS: + for v in tok.values: + _validate_numeric(field, v) + return " AND ".join(f"{field}:{v}" for v in tok.values) + if isinstance(tok, FieldValue): + field = FIELD_ALIASES.get(tok.field, tok.field) + if field in DATE_FIELDS: + return translate_scalar(field, tok.value, tz) + if field in NUMERIC_FIELDS: + _validate_numeric(field, tok.value) + return f"{field}:{tok.value}" + if isinstance(tok, FieldRange): + field = FIELD_ALIASES.get(tok.field, tok.field) + if field in DATE_FIELDS: + return translate_range(field, tok.lo, tok.hi, tz) + if field in NUMERIC_FIELDS: + _validate_numeric(field, tok.lo) + _validate_numeric(field, tok.hi) + return f"{field}:{tok.open}{tok.lo} TO {tok.hi}{tok.close}" +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd src && uv run pytest "documents/tests/search/test_error_shapes.py::TestProactiveNumericValidation" --override-ini="addopts=" -v` +Expected: PASS. + +- [ ] **Step 5: Run the full translate test file to check for regressions** + +Run: `cd src && uv run pytest documents/tests/search/test_translate.py --override-ini="addopts=" -q` +Expected: PASS (no existing translate behavior broken). + +- [ ] **Step 6: Commit** + +```bash +git add src/documents/search/_translate.py src/documents/tests/search/test_error_shapes.py +git commit -m "feat(search): proactively validate numeric field values" +``` + +--- + +### Task 3: Tantivy-error mapper and malformed-query hint + +**Files:** + +- Modify: `src/documents/search/_translate.py` (add `_suggest_field`, `_malformed_hint`, `map_tantivy_error`) +- Test: `src/documents/tests/search/test_error_shapes.py` + +- [ ] **Step 1: Write the failing test** + +Append to `src/documents/tests/search/test_error_shapes.py`: + +```python +from documents.search._translate import map_tantivy_error + + +@pytest.mark.search +class TestMapTantivyError: + def test_unknown_field_maps_with_suggestion(self): + exc = ValueError("Field does not exist: 'corespondent'") + mapped = map_tantivy_error(exc, "corespondent:foo") + assert isinstance(mapped, UnknownFieldError) + assert mapped.field == "corespondent" + assert mapped.suggestion == "correspondent" + + def test_unknown_field_maps_without_suggestion(self): + exc = ValueError("Field does not exist: 'has_tags'") + mapped = map_tantivy_error(exc, "has_tags:true") + assert isinstance(mapped, UnknownFieldError) + assert mapped.field == "has_tags" + assert mapped.suggestion is None + + def test_integer_error_maps_to_invalid_value(self): + exc = ValueError("Expected a valid integer: 'ParseIntError { kind: InvalidDigit }'") + mapped = map_tantivy_error(exc, "asn:x") + assert isinstance(mapped, InvalidFieldValueError) + assert "ParseIntError" not in str(mapped) + + @pytest.mark.parametrize( + ("raw", "fragment"), + [ + ('title:"abc', "quote"), + ("(invoice OR bill", "parenthes"), + ("created:[2020 TO 2021", "bracket"), + ("invoice AND", "AND/OR/NOT"), + ("OR invoice", "AND/OR/NOT"), + ], + ) + def test_syntax_error_maps_to_specific_hint(self, raw, fragment): + exc = ValueError(f"Syntax Error: {raw}") + mapped = map_tantivy_error(exc, raw) + assert isinstance(mapped, MalformedQueryError) + assert fragment.lower() in str(mapped).lower() + assert raw not in str(mapped) # never echo the raw query verbatim + + def test_balanced_open_numeric_range_gets_generic_hint(self): + # asn:[1 TO] is a Syntax Error but brackets ARE balanced: must NOT claim + # "unbalanced bracket". + exc = ValueError("Syntax Error: asn:[1 TO ]") + mapped = map_tantivy_error(exc, "asn:[1 TO]") + assert isinstance(mapped, MalformedQueryError) + assert "unbalanced" not in str(mapped).lower() + + def test_unrecognized_message_returns_none(self): + exc = ValueError("Some brand new tantivy error") + assert map_tantivy_error(exc, "whatever") is None +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd src && uv run pytest "documents/tests/search/test_error_shapes.py::TestMapTantivyError" --override-ini="addopts=" -v` +Expected: FAIL with `ImportError: cannot import name 'map_tantivy_error'`. + +- [ ] **Step 3: Write minimal implementation** + +In `src/documents/search/_translate.py`, add near the other error helpers (after the `MalformedQueryError` class is fine; place all three together at the end of the error-class section): + +```python +_FIELD_MISSING_RE = regex.compile(r"^Field does not exist: '(?P[^']*)'") + +_GENERIC_MALFORMED = ( + "Could not parse the search query. Check for unbalanced quotes, brackets, " + "or parentheses, or a misplaced AND/OR/NOT operator." +) + + +def _suggest_field(field: str) -> str | None: + """Return the closest valid field name to ``field``, or None.""" + matches = difflib.get_close_matches(field, SEARCHABLE_FIELDS, n=1) + return matches[0] if matches else None + + +def _malformed_hint(raw_query: str) -> str: + """Best-effort specific hint for a structural error; generic fallback. + + Only claims a specific cause when it is structurally evident (unbalanced + delimiters or a clearly misplaced boolean operator); otherwise returns the + generic message so we never assert a wrong-but-confident cause. + """ + if raw_query.count('"') % 2 != 0: + return "Unbalanced quote in the search query." + if raw_query.count("(") != raw_query.count(")"): + return "Unbalanced parenthesis in the search query." + if ( + raw_query.count("[") != raw_query.count("]") + or raw_query.count("{") != raw_query.count("}") + ): + return "Unbalanced bracket in the search query." + upper = raw_query.strip().upper() + if upper.startswith(("AND ", "OR ")) or upper.endswith((" AND", " OR", " NOT")): + return "Misplaced AND/OR/NOT operator in the search query." + return _GENERIC_MALFORMED + + +def map_tantivy_error(exc: ValueError, raw_query: str) -> SearchQueryError | None: + """Map a tantivy parse_query ValueError to a user-safe SearchQueryError. + + Returns None when the message is not a recognised family, so the caller can + re-raise the original (preserving today's generic 400 for truly unknown + errors rather than inventing a misleading message). + """ + message = str(exc) + m = _FIELD_MISSING_RE.match(message) + if m is not None: + field = m.group("field") + return UnknownFieldError(field, _suggest_field(field)) + if message.startswith("Expected a valid integer"): + return InvalidFieldValueError() + if message.startswith("Syntax Error"): + return MalformedQueryError(_malformed_hint(raw_query)) + return None +``` + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd src && uv run pytest "documents/tests/search/test_error_shapes.py::TestMapTantivyError" --override-ini="addopts=" -v` +Expected: PASS. + +- [ ] **Step 5: Commit** + +```bash +git add src/documents/search/_translate.py src/documents/tests/search/test_error_shapes.py +git commit -m "feat(search): map tantivy parse errors to user-safe messages" +``` + +--- + +### Task 4: Backstop wrapper wired into `parse_user_query` + +**Files:** + +- Modify: `src/documents/search/_query.py` (import `map_tantivy_error`; add `_parse_query_friendly`; use it at lines 231-235 and 253-259) +- Test: `src/documents/tests/search/test_error_shapes.py` + +- [ ] **Step 1: Write the failing test** + +Append to `src/documents/tests/search/test_error_shapes.py`: + +```python +import tantivy + +from documents.search._query import parse_user_query +from documents.search._translate import SearchQueryError as _SQE # noqa: F401 + + +@pytest.mark.search +class TestBackstopViaParseUserQuery: + """Uses the module-scope ``index`` fixture from conftest.py.""" + + def test_unknown_field_raises_unknown_field_error(self, index: tantivy.Index): + with pytest.raises(UnknownFieldError) as exc_info: + parse_user_query(index, "foobar:hello", UTC) + assert exc_info.value.field == "foobar" + + def test_unknown_field_suggestion(self, index: tantivy.Index): + with pytest.raises(UnknownFieldError) as exc_info: + parse_user_query(index, "corespondent:bob", UTC) + assert exc_info.value.suggestion == "correspondent" + + def test_legacy_backend_field_is_unknown(self, index: tantivy.Index): + with pytest.raises(UnknownFieldError) as exc_info: + parse_user_query(index, "has_tags:true", UTC) + assert exc_info.value.field == "has_tags" + + @pytest.mark.parametrize( + "query", + ["(invoice OR bill", "invoice AND", "OR invoice", 'title:"abc'], + ) + def test_syntax_error_raises_malformed(self, index: tantivy.Index, query): + with pytest.raises(MalformedQueryError): + parse_user_query(index, query, UTC) + + def test_open_numeric_range_is_malformed_not_unbalanced(self, index: tantivy.Index): + with pytest.raises(MalformedQueryError) as exc_info: + parse_user_query(index, "asn:[1 TO]", UTC) + assert "unbalanced" not in str(exc_info.value).lower() + + @pytest.mark.parametrize( + "query", + ["page_count:>5", "asn:<10", "page_count:>=5", "asn:[1 TO 10]", "tag_id:1,2,3"], + ) + def test_comparison_and_range_queries_succeed(self, index: tantivy.Index, query): + assert isinstance(parse_user_query(index, query, UTC), tantivy.Query) + + @pytest.mark.parametrize( + "query", + ["notes.user:alice", "custom_fields.name:invoice"], + ) + def test_dotted_json_subfields_not_flagged(self, index: tantivy.Index, query): + assert isinstance(parse_user_query(index, query, UTC), tantivy.Query) + + def test_numeric_mismatch_raises_invalid_value(self, index: tantivy.Index): + # Proactive pass fires inside translate_query before parse_query. + with pytest.raises(InvalidFieldValueError) as exc_info: + parse_user_query(index, "asn:notanumber", UTC) + assert exc_info.value.field == "asn" +``` + +- [ ] **Step 2: Run test to verify it fails** + +Run: `cd src && uv run pytest "documents/tests/search/test_error_shapes.py::TestBackstopViaParseUserQuery" --override-ini="addopts=" -v` +Expected: FAIL — unknown-field/syntax cases currently raise the bare Tantivy `ValueError`, not the new subclasses (the `index.parse_query` calls are unwrapped). The numeric-mismatch and success cases may already pass. + +- [ ] **Step 3: Write minimal implementation** + +In `src/documents/search/_query.py`, add the import alongside the existing translate imports (after line 20): + +```python +from documents.search._translate import map_tantivy_error +``` + +Add a module-level helper (place it just above `parse_user_query`, before line 191): + +```python +def _parse_query_friendly( + index: tantivy.Index, + query_str: str, + raw_query: str, + default_fields: list[str], + **kwargs, +) -> tantivy.Query: + """Call index.parse_query, translating Tantivy ValueErrors into user-safe + SearchQueryError subclasses. Unrecognised errors are re-raised unchanged.""" + try: + return index.parse_query(query_str, default_fields, **kwargs) + except SearchQueryError: + raise + except ValueError as exc: + mapped = map_tantivy_error(exc, raw_query) + if mapped is not None: + raise mapped from exc + raise +``` + +In `parse_user_query`, replace the exact-query parse (lines 231-235): + +```python + exact = _parse_query_friendly( + index, + query_str, + raw_query, + DEFAULT_SEARCH_FIELDS, + field_boosts=_FIELD_BOOSTS, + ) +``` + +and the fuzzy parse (lines 253-259): + +```python + fuzzy = _parse_query_friendly( + index, + query_str, + raw_query, + DEFAULT_SEARCH_FIELDS, + field_boosts=_FIELD_BOOSTS, + # (prefix=True, distance=1, transposition_cost_one=True) — edit-distance fuzziness + fuzzy_fields={f: (True, 1, True) for f in DEFAULT_SEARCH_FIELDS}, + ) +``` + +(`SearchQueryError` is already imported in `_query.py` at line 19.) + +- [ ] **Step 4: Run test to verify it passes** + +Run: `cd src && uv run pytest "documents/tests/search/test_error_shapes.py::TestBackstopViaParseUserQuery" --override-ini="addopts=" -v` +Expected: PASS. + +- [ ] **Step 5: Run the full query test file for regressions** + +Run: `cd src && uv run pytest documents/tests/search/test_query.py --override-ini="addopts=" -q` +Expected: PASS (existing `parse_user_query` behavior, including `InvalidDateQuery` propagation, intact). + +- [ ] **Step 6: Commit** + +```bash +git add src/documents/search/_query.py src/documents/tests/search/test_error_shapes.py +git commit -m "feat(search): wrap parse_query to surface friendly error shapes" +``` + +--- + +### Task 5: Guard tests (pin prefixes + drift) and view-level 400 + +**Files:** + +- Modify: `src/documents/search/__init__.py` (verify the new error classes are exported; add if missing) +- Test: `src/documents/tests/search/test_error_shapes.py` (pin + drift guards) +- Test: `src/documents/tests/test_api_search.py` (one view-level integration test) + +- [ ] **Step 1: Verify the search package exports the new classes** + +Run: `cd src && rg -n "SearchQueryError|InvalidDateQuery|__all__" documents/search/__init__.py` + +If `SearchQueryError` is re-exported there (the view imports `from documents.search import SearchQueryError`), add the three new classes the same way. Example edit — add to the existing `from documents.search._translate import ...` block and to `__all__` if present: + +```python +from documents.search._translate import InvalidFieldValueError +from documents.search._translate import MalformedQueryError +from documents.search._translate import UnknownFieldError +``` + +(The subclasses route through the existing `except SearchQueryError` handler regardless, so exporting is for discoverability/consumers. Skip if the package does not re-export error classes.) + +- [ ] **Step 2: Write the failing pin + drift guard tests** + +Append to `src/documents/tests/search/test_error_shapes.py`: + +```python +from documents.search._query import DEFAULT_SEARCH_FIELDS +from documents.search._query import _FIELD_BOOSTS +from documents.search._translate import SEARCHABLE_FIELDS as _SEARCHABLE + + +@pytest.mark.search +class TestTantivyPinnedPrefixes: + """If a tantivy-py upgrade changes these prefixes, the backstop silently + regresses to the generic 400. Pin them so the upgrade fails loudly.""" + + def _err(self, index: tantivy.Index, raw: str) -> str: + with pytest.raises(ValueError) as exc_info: + index.parse_query(raw, DEFAULT_SEARCH_FIELDS, field_boosts=_FIELD_BOOSTS) + return str(exc_info.value) + + def test_unknown_field_prefix(self, index: tantivy.Index): + assert self._err(index, "foobar:hello").startswith("Field does not exist:") + + def test_syntax_error_prefix(self, index: tantivy.Index): + assert self._err(index, "(invoice OR bill").startswith("Syntax Error") + + def test_integer_error_prefix(self, index: tantivy.Index): + assert self._err(index, "asn:notanumber").startswith("Expected a valid integer") + + +@pytest.mark.search +class TestFieldDriftGuard: + """Every user-facing searchable field must be a real schema field. tantivy + exposes no field-name list, so we assert via parse: a real field never raises + 'Field does not exist'.""" + + @pytest.mark.parametrize("field", sorted(_SEARCHABLE)) + def test_searchable_field_exists_in_schema(self, index: tantivy.Index, field): + try: + index.parse_query( + f"{field}:1", + DEFAULT_SEARCH_FIELDS, + field_boosts=_FIELD_BOOSTS, + ) + except ValueError as exc: + # A type/syntax error proves the field EXISTS; only "does not exist" + # is a drift failure. + assert "Field does not exist" not in str(exc), ( + f"{field!r} is in SEARCHABLE_FIELDS but missing from the schema" + ) +``` + +- [ ] **Step 3: Run the guard tests to verify they pass** + +Run: `cd src && uv run pytest "documents/tests/search/test_error_shapes.py::TestTantivyPinnedPrefixes" "documents/tests/search/test_error_shapes.py::TestFieldDriftGuard" --override-ini="addopts=" -v` +Expected: PASS. (These assert current truth; they guard against future drift. If `TestFieldDriftGuard` fails now, `SEARCHABLE_FIELDS` lists a name not in the schema — fix `KNOWN_FIELDS`/`NUMERIC_FIELDS`, not the test.) + +- [ ] **Step 4: Write the failing view-level test** + +In `src/documents/tests/test_api_search.py`, locate `test_search_added_invalid_date` (around line 723) and add this test directly after it, inside the same `TestDocumentSearchApi` class (mirrors that test's structure): + +```python + def test_search_unknown_field_returns_400(self) -> None: + """ + GIVEN: + - A query scoping on a non-existent field + WHEN: + - The search API is called + THEN: + - HTTP 400 with the unknown-field message under the "query" key + """ + response = self.client.get("/api/documents/?query=foobar:hello") + self.assertEqual(response.status_code, status.HTTP_400_BAD_REQUEST) + self.assertIn("foobar", str(response.data["query"])) +``` + +- [ ] **Step 5: Run the view-level test to verify it passes** + +Run: `cd src && uv run pytest "documents/tests/test_api_search.py::TestDocumentSearchApi::test_search_unknown_field_returns_400" --override-ini="addopts=" -v` +Expected: PASS (the existing `except SearchQueryError` handler converts `UnknownFieldError` to `ValidationError({"query": [...]})`). + +- [ ] **Step 6: Commit** + +```bash +git add src/documents/search/__init__.py src/documents/tests/search/test_error_shapes.py src/documents/tests/test_api_search.py +git commit -m "test(search): pin tantivy error prefixes, guard field drift, view 400" +``` + +--- + +### Task 6: Full suite + lint + +**Files:** none (verification only) + +- [ ] **Step 1: Run the whole search test directory** + +Run: `cd src && uv run pytest documents/tests/search/ --override-ini="addopts=" -q` +Expected: PASS. + +- [ ] **Step 2: Run the API search tests** + +Run: `cd src && uv run pytest documents/tests/test_api_search.py --override-ini="addopts=" -q` +Expected: PASS. + +- [ ] **Step 3: Lint the changed files** + +Run: `cd src && uv run ruff check documents/search/_translate.py documents/search/_query.py documents/tests/search/test_error_shapes.py` +Expected: no errors (fix any import-ordering/formatting issues ruff reports; run `uv run ruff format` on the same files if needed). + +- [ ] **Step 4: Final commit (only if lint produced changes)** + +```bash +git add -A +git commit -m "chore(search): lint error-shapes follow-up" +``` + +--- + +## Self-Review + +**Spec coverage:** + +- `UnknownFieldError` (+ did-you-mean, legacy backend fields as unknown) -> Tasks 1, 3, 4. +- `InvalidFieldValueError` (proactive + backstop) -> Tasks 1, 2, 4. +- `MalformedQueryError` (balance-check, no verbatim echo, open-range caveat) -> Tasks 1, 3, 4. +- Hybrid detection (proactive scanner + backstop wrapper) -> Tasks 2, 4. +- `>`/`<` left working + validator allows operators -> Task 2 (`test_valid_numeric_values_do_not_raise`), Task 4 (`test_comparison_and_range_queries_succeed`). +- Single source of truth + drift guard -> Task 1 (`SEARCHABLE_FIELDS`), Task 5 (`TestFieldDriftGuard`). +- Message-prefix pin test -> Task 5 (`TestTantivyPinnedPrefixes`). +- Dotted-JSON / open-numeric-range / view-400 -> Tasks 4 and 5. +- Out of scope (frontend, URL search) -> correctly untouched. + +**Placeholder scan:** none — every code step shows full code and exact commands. + +**Type/name consistency:** `UnknownFieldError(field, suggestion=)`, `InvalidFieldValueError(field=None, value=None)`, `MalformedQueryError(message)`, `NUMERIC_FIELDS`, `SEARCHABLE_FIELDS`, `_validate_numeric(field, value)`, `_suggest_field(field)`, `_malformed_hint(raw_query)`, `map_tantivy_error(exc, raw_query)`, `_parse_query_friendly(index, query_str, raw_query, default_fields, **kwargs)` are used identically across all tasks. diff --git a/docs/superpowers/specs/2026-06-15-search-error-shapes-followup-design.md b/docs/superpowers/specs/2026-06-15-search-error-shapes-followup-design.md new file mode 100644 index 000000000..3cf0701d6 --- /dev/null +++ b/docs/superpowers/specs/2026-06-15-search-error-shapes-followup-design.md @@ -0,0 +1,262 @@ +# Friendlier advanced-search error shapes + +**Status:** design / ready for implementation plan. +**Follow-up to:** the `InvalidDateQuery` work on branch `fix/search-query-translation` +(PR #13010), itself specced in +`docs/superpowers/done/specs/2026-06-14-search-query-translation-design.md`. +**Builds on:** the `SearchQueryError(ValueError)` base in +`documents/search/_translate.py` and the single `except SearchQueryError` handler +in `UnifiedSearchViewSet.list` (`documents/views.py:2477`), which re-raises as DRF +`ValidationError({"query": [msg]})`. Any new subclass surfaces through that one +handler automatically, so this work is purely additive. + +## Problem + +Every advanced-search failure other than the now-handled invalid date lands in +the view's generic `except Exception` and returns +`HttpResponseBadRequest("Error listing search results, check logs for more +detail.")` (`views.py:2479-2482`). `index.parse_query(...)` runs _outside_ the +`translate_query` try/except in `parse_user_query` (`_query.py:220-235`), so +anything Tantivy rejects bypasses `SearchQueryError` entirely and gets the +unhelpful generic 400. Some Tantivy errors also leak Rust internals (e.g. +`ParseIntError { kind: InvalidDigit }`) if surfaced verbatim. + +## Ground truth: what Tantivy raises (empirically re-verified 2026-06-15) + +Probed against a real index built from `documents.search._schema.build_schema` + +`_tokenizer.register_tokenizers`, running each query through `translate_query` +then `index.parse_query(..., DEFAULT_SEARCH_FIELDS, field_boosts=_FIELD_BOOSTS)`. +`index.parse_query` raises `ValueError` with three distinguishable message +families: + +| Family | Example inputs | Tantivy message | +| ---------------- | ---------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ | +| Unknown field | `foobar:hello`, `owner:5`, `has_tags:true`, `is_shared:true` | `Field does not exist: 'foobar'` | +| Syntax error | `(invoice OR bill`, `created:[2020 TO 2021`, `invoice AND`, `OR invoice`, `title:"abc`, `:value`, `AND OR` | `Syntax Error: ` | +| Wrong value type | `asn:notanumber`, `page_count:[foo TO bar]` | `Expected a valid integer: 'ParseIntError { kind: InvalidDigit }'` | + +**Parses OK (NOT errors):** `page_count:>5`, `asn:<10`, `page_count:>=5` (these +produce _correct_ Tantivy `RangeQuery` objects — see the `>`/`<` decision below), +`page_count:5` (TermQuery), `title:[a TO b]` (Str range on a text field), +`title:~~~` (EmptyQuery), `` (empty query). + +## The `>`/`<`/`>=`/`<=` decision (investigated) + +The stub flagged these as "parses OK but possibly silently wrong vs Whoosh." That +is **incorrect** — verified empirically: + +- `page_count:>5` -> `RangeQuery { lower_bound: Excluded(5), upper_bound: Unbounded }` +- `page_count:>=5` -> `RangeQuery { lower_bound: Included(5), ... }` +- `asn:<10` -> `RangeQuery { lower_bound: Unbounded, upper_bound: Excluded(10) }` + +Tantivy's query parser supports comparison operators natively and produces +correct range queries. They were _never_ supported in paperless-Whoosh (no +`GtLtPlugin`; confirmed in the base design §3 and against the old schema on +`main`), so there is no prior behavior to "match" and nothing is silently wrong. + +**Decision: leave them as-is.** They work correctly and are effectively a free +capability gain over Whoosh. Do not reject them. The numeric value validator +(below) must explicitly _allow_ a leading `>`, `<`, `>=`, or `<=` so it does not +wrongly reject working comparison queries. Document them as supported. + +Note the asymmetry (verified): the _comparison_ forms work, but _open +bracket-ranges_ on numeric fields do NOT. `asn:[1 TO]` and `asn:[TO 10]` +translate verbatim and Tantivy raises `Syntax Error` on them — only open _date_ +ranges are rewritten to bounded sentinels (`OPEN_LO`/`OPEN_HI`) by +`translate_range`. So `asn:>1` works but `asn:[1 TO]` is malformed; readers must +not assume bracket-ranges behave like the comparison operators. + +## Clarification: the old "Whoosh-only" fields were backend internals + +The stub proposed targeted "this field isn't available in full-text search; use +the filter sidebar" messages for `owner`, `has_tags`, `is_shared`, etc. Checking +the old Whoosh code still present on `main` (`src/documents/index.py`, +`make_schema`) shows these were **backend filter / permission fields, not +user-facing search syntax**: + +- `has_correspondent`, `has_tag`, `has_type`, `has_path`, `has_owner`, + `has_custom_fields`, `is_shared`, `custom_field_count`, and `owner` (username + text) were populated at index time so the permission and list-filter machinery + could query them. `get_permissions_criterias()` builds + `query.Term("has_owner", ...)`, `query.Term("owner_id", ...)`, + `query.Term("viewer_id", ...)` programmatically. +- The user-facing full-text parser (`DelayedFullTextQuery`) advertised only + `["content", "title", "correspondent", "tag", "type", "notes", +"custom_fields"]`. (Whoosh's generic `FieldsPlugin` would technically parse a + typed `has_tag:true`, but that was never a designed or documented feature.) + +**Consequence:** no curated per-field messages. These names are simply _unknown +fields_ and flow through `UnknownFieldError` + did-you-mean like any typo. Note +that `owner_id` and `viewer_id` legitimately remain queryable (present in +`KNOWN_FIELDS` and the new schema, still used for permission filtering), so they +are correctly _not_ unknown; `owner`, the `has_*` booleans, `is_shared`, and +`custom_field_count` are correctly absent and read as plain unknown fields. + +## Proposed error shapes (all `SearchQueryError` subclasses) + +All three live next to `InvalidDateQuery` in `_translate.py` and inherit the +"message is safe to surface" contract. + +### 1. `UnknownFieldError` + +`Unknown search field 'corespondent'.` plus a `Did you mean 'correspondent'?` +suggestion via `difflib.get_close_matches(field, suggestion_pool, n=1)`. The +suggestion pool is the user-facing field set derived from `KNOWN_FIELDS` (see +single-source-of-truth below). Typos get a suggestion; names with no close match +(e.g. `has_tags`) get the bare unknown-field message. + +### 2. `InvalidFieldValueError` + +Sibling of `InvalidDateQuery`. For numeric fields: `Field 'asn' expects a number, +got 'notanumber'.` Carries `field` + `value` attributes like `InvalidDateQuery`. +Removes the `ParseIntError { ... }` Rust leak. + +### 3. `MalformedQueryError` + +Structural syntax errors. A cheap balance-check pass gives specific hints for the +common cases (unbalanced `"`, `[`, `(`, dangling/leading `AND`/`OR`) before +falling back to a clean generic "check for unbalanced quotes, brackets, or +parentheses." Tantivy's message has no location and echoes the whole query, so it +is **never** surfaced verbatim. + +Caveat: an open numeric bracket-range (`asn:[1 TO]`) reaches this path as a +`Syntax Error` even though its brackets are _balanced_. The balance-check must not +confidently assert "unbalanced brackets" for it — when brackets/quotes/parens are +balanced, fall back to the generic hint rather than a wrong-but-specific one. + +## Detection strategy: hybrid, split by what each layer can see + +The scanner and the parse-wrapper see different things; assigning each error +shape to the layer that can detect it cleanly avoids false positives. + +### Proactive numeric validation in the scanner (`_translate.py`) + +`scan()` already tokenizes recognized `field:value` and `field:[range]` clauses +into `FieldValue` / `FieldRange` for fields in `KNOWN_FIELDS`. Add a +`NUMERIC_FIELDS` set (`asn`, `page_count`, `num_notes`, and the `*_id` fields) +and validate those tokens' values during translation, raising +`InvalidFieldValueError` before the string ever reaches `index.parse_query`. + +Validation rules: + +- Strip a single leading comparison operator (`>=`, `<=`, `>`, `<`) before the + integer check, so comparison queries pass. +- For ranges, validate each present bound (`lo`, `hi`) as an integer. An empty + bound passes the _integer_ check (an empty string is not a bad integer), but be + aware that an open numeric bracket-range (`asn:[1 TO]`) is still rejected + downstream by Tantivy as a `Syntax Error` (see the `>`/`<` note above) — i.e. + the validator does not make it succeed, it falls through to the malformed path. + Do not emit a "bad number" message for an empty bound; let the structural error + surface as `MalformedQueryError`. +- For multi-value numeric fields after comma expansion (`tag_id`, `viewer_id`), + validate each expanded value (`tag_id:1,foo` -> `InvalidFieldValueError`). + +This path owns `InvalidFieldValueError` exclusively: messages are rich, +context-aware, and independent of Tantivy's English strings. + +### Why unknown-field detection is NOT proactive + +`_match_field_token` returns `None` for any field not in `KNOWN_FIELDS` +(`_translate.py:193`) — an unknown `foobar:hello` is intentionally left as a +`Passthrough` (the existing `http:`-misfire guard). Detecting unknown fields in +the scanner would require a separate `\w+:` pass that re-introduces exactly the +false positives that guard exists to prevent: + +- URLs: `http://example.com/a` (`http:`) +- Dotted JSON subfields, which are valid: `notes.user:alice`, + `custom_fields.invoice_no:123` (the `\w+:` regex would see `user:` / `invoice_no:`) +- Time-like literals: `12:30` (note: a bare `12:30` is already a parse failure + today — Tantivy raises `Field does not exist: '12'` — so under this design it + reshapes into `UnknownFieldError("12")` with no close match, i.e. the bare + unknown-field message. It is _not_ a clean passthrough; the point here is only + that a proactive `\w+:` pass would mis-flag it even more aggressively.) + +So unknown fields are detected in the backstop instead, where Tantivy has already +confirmed the token is a real field reference. + +### Catch-and-sanitize backstop around `index.parse_query` (`_query.py`) + +Wrap the `index.parse_query` call(s) in `parse_user_query`. Map residual Tantivy +`ValueError` messages by prefix: + +- `Field does not exist: 'X'` -> extract `X`, build `UnknownFieldError(X)` with + did-you-mean. +- `Syntax Error: ...` -> `MalformedQueryError` (run the balance-check for a + specific hint; never echo the Tantivy text). +- `Expected a valid integer: ...` -> `InvalidFieldValueError`. This backstop is + effectively _unreachable today_: every query that produces this Tantivy error + goes through a recognized numeric field token (`asn`, `page_count`, `num_notes`, + `*_id`) that `scan()` already models, so the proactive pass catches it first + (verified — no query reaches this branch without the proactive pass firing). + Its real value is forward-safety: if a future numeric field is added to the + schema but not to `NUMERIC_FIELDS`, this branch guarantees the raw + `ParseIntError { ... }` Rust struct never leaks. Keep it; the generic "expects a + number" message stands in when `field`/`value` cannot be recovered from the + Tantivy text. +- Anything unrecognized -> re-raise, preserving today's generic-500/400 path + rather than inventing a misleading message. + +Both the fuzzy and exact `parse_query` calls go through the same wrapper. + +### Single source of truth for fields + +`KNOWN_FIELDS` (`_translate.py:63`) is the canonical set of field names a user may +validly scope on; it already includes the v2 aliases (`type`, `path`, `type_id`, +`path_id`) that `translate_query` rewrites to real schema names. Use it (with +aliases optionally excluded from the _suggestion_ pool to avoid suggesting a +deprecated alias) for both validation and did-you-mean. + +Add a drift-guard test asserting `KNOWN_FIELDS` minus the alias set is a subset of +the schema field names produced by `build_schema()`, so the two definitions cannot +silently diverge as the schema evolves. The backend-only Whoosh names (`owner`, +`has_*`, `is_shared`, `custom_field_count`) are correctly excluded from both. + +## Testing + +New dedicated test file (per project convention), e.g. +`src/documents/tests/test_search_error_shapes.py`: + +- One case per error family asserting the `SearchQueryError` subclass and a + user-safe message (no paths, no Rust structs, no verbatim Tantivy echo). +- `UnknownFieldError`: typo yields a did-you-mean suggestion; a no-close-match + name (e.g. `has_tags`) yields the bare message. +- `InvalidFieldValueError`: `asn:notanumber`, `page_count:[foo TO bar]`, and a + bad multi-value `tag_id:1,foo`. +- **`>`/`<` working case**: `page_count:>5` / `asn:<10` / `page_count:>=5` parse + successfully and are NOT raised as errors (guards the numeric validator's + operator allowance). +- **Open numeric range**: `asn:[1 TO]` / `asn:[TO 10]` surface as + `MalformedQueryError` (Tantivy `Syntax Error`, brackets balanced) and the hint + is the generic one, NOT a false "unbalanced brackets" claim. +- **Dotted JSON non-regression**: `notes.user:alice`, + `custom_fields.name:invoice` are not flagged as unknown fields. +- **URL behavior**: a query containing `http://...` is unchanged from today's + behavior — Tantivy treats `http` as a field, so it still 400s; under this design + the message becomes the clearer `UnknownFieldError('http')` (no close match) and + the proactive numeric pass does not touch it. This is a clarity gain, not a + regression (it already 400'd generically). Out of scope to make URL substrings + searchable. +- **Message-prefix pin**: a test that asserts the exact Tantivy prefixes + (`Field does not exist:`, `Syntax Error:`, `Expected a valid integer:`) the + backstop depends on, so a `tantivy-py` upgrade that changes them fails loudly + instead of silently regressing to the generic 400. +- **Drift guard**: `KNOWN_FIELDS` (minus aliases) ⊆ `build_schema()` field names. +- View-level: each subclass surfaces as HTTP 400 with `{"query": [msg]}` through + the existing handler. + +## Risks / notes + +- The backstop depends on Tantivy's error _message-string prefixes_, which are + brittle across `tantivy-py` upgrades. The pin test above is the mitigation. +- Keep all messages safe to surface: they may echo user input but must never + include internal paths, stack details, or Rust error structs. +- The balance-check for `MalformedQueryError` is a heuristic for _hints_ only; its + failure mode is the clean generic message, never a wrong-but-confident one. + +## Out of scope + +- Frontend rendering of the structured `{"query": [...]}` 400 (the inline search + error UI). Only relevant if the messages should render differently from the + current generic banner; the current banner already displays the message. +- Adding or changing `>`/`<` semantics. They work; this spec only ensures the + numeric validator does not break them. diff --git a/docs/superpowers/specs/2026-06-15-search-error-shapes-followup-stub.md b/docs/superpowers/specs/2026-06-15-search-error-shapes-followup-stub.md deleted file mode 100644 index 34f3d1aeb..000000000 --- a/docs/superpowers/specs/2026-06-15-search-error-shapes-followup-stub.md +++ /dev/null @@ -1,99 +0,0 @@ -# Spec stub: friendlier advanced-search error shapes (follow-up) - -**Status:** stub / not yet planned. Follow-up to the `InvalidDateQuery` work on -branch `fix/search-query-translation` (PR #13010). -**Builds on:** the `SearchQueryError(ValueError)` base added in -`documents/search/_translate.py` and the single `except SearchQueryError` handler -in `UnifiedSearchViewSet.list` (`documents/views.py`). Any new subclass surfaces -through that one handler automatically, so this work is purely additive. - -## Problem - -Every failure on the advanced-search path (other than the now-handled invalid -date) lands in the view's generic `except Exception` and returns -`HttpResponseBadRequest("Error listing search results, check logs for more -detail.")`. `index.parse_query(...)` runs _outside_ the `translate_query` -try/except in `parse_user_query`, so anything Tantivy rejects bypasses -`SearchQueryError` entirely and gets the unhelpful generic 400. Some Tantivy -errors also leak Rust internals if surfaced verbatim. - -## Ground truth: what Tantivy raises (empirically probed 2026-06-15) - -`index.parse_query` raises `ValueError` with three distinguishable message -families: - -| Family | Example inputs | Tantivy message | -| ---------------- | ---------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------ | -| Unknown field | `foobar:hello`, `owner:5`, `has_tags:true`, `is_shared:true` | `Field does not exist: 'foobar'` | -| Syntax error | `(invoice OR bill`, `created:[2020 TO 2021`, `invoice AND`, `OR invoice`, `title:"abc`, `:value`, `AND OR` | `Syntax Error: ` | -| Wrong value type | `asn:notanumber`, `page_count:[foo TO bar]` | `Expected a valid integer: 'ParseIntError { kind: InvalidDigit }'` | - -Parses OK (NOT errors, but possibly silently wrong — out of scope for error -shapes, noted for awareness): `page_count:>5`, `asn:<10` (the `>`/`<` -comparisons), `title:[a TO b]` (range on a text field), `title:~~~` (bad fuzzy), -`` (empty query). - -## Proposed error shapes (all `SearchQueryError` subclasses) - -### 1. `UnknownFieldError` (highest value) - -Distinguish two sub-cases: - -- **Typo** -> `Unknown search field 'corespondent'.` plus a "Did you mean - 'correspondent'?" suggestion via `difflib.get_close_matches` against the known - schema field set. -- **Whoosh-only field with no Tantivy equivalent** (finite known set: `owner`, - `has_tags`, `has_correspondent`/other `has_*`, `is_shared`, custom-field - id/count) -> targeted message, e.g. "`has_tags` isn't available in full-text - search; use the filter sidebar," rather than "field does not exist." - -The list of valid fields already exists implicitly: schema fields in -`documents/search/_schema.py` and `KNOWN_FIELDS` in `_translate.py`. Reconcile to -a single source of truth for validation + suggestions. - -### 2. `InvalidFieldValueError` (sibling of `InvalidDateQuery`) - -For numeric fields (`asn`, `page_count`, `num_notes`, `*_id`): "Field 'asn' -expects a number, got 'notanumber'." Carries `field` + `value` like -`InvalidDateQuery`. Also removes the `ParseIntError { ... }` Rust leak. - -### 3. `MalformedQueryError` (structural syntax) - -A cheap balance-check pass gives specific hints for the common cases (unbalanced -`"`, `[`, `(`, dangling `AND`/`OR`) before falling back to a clean generic -"check for unbalanced quotes, brackets, or parentheses." Tantivy's message has no -location, so do not echo it verbatim. - -## Detection strategy: hybrid - -- **Proactive validation in the scanner** (`_translate.py`): it already tokenizes - `field:value` / `field:[range]`, so extend it to validate field existence (+ - suggestions) and numeric value types up front. Messages are rich, - context-aware, and independent of Tantivy's English error strings. Lives next - to `InvalidDateQuery`. Covers only the field-scoped tokens it recognizes. -- **Catch-and-sanitize wrapper** around `index.parse_query` in - `parse_user_query`: map residual `Syntax Error:` / type-mismatch messages into - `MalformedQueryError` / `InvalidFieldValueError` so nothing leaks internals or - the generic message. Backstop for pure structural errors the scanner does not - model. - -Both paths raise `SearchQueryError`; the existing view handler routes them all. - -## Risks / notes - -- The catch-and-sanitize wrapper depends on Tantivy's error _message string - prefixes_ (`Field does not exist:`, `Syntax Error:`, `Expected a valid -integer:`), which are brittle across `tantivy-py` upgrades. Add a test that - pins those prefixes so an upgrade that changes them fails loudly rather than - silently regressing to the generic 400. -- Decide separately whether `>`/`<` comparisons should be supported, rejected - with a message, or left as-is. They currently parse without error and are - likely silently wrong relative to Whoosh semantics. -- Keep messages safe to surface: they may echo user input but must not include - internal paths, stack details, or Rust error structs. - -## Out of scope - -Frontend rendering of the structured `{"query": [...]}` 400 (the inline search -error UI) — only relevant if the messages should render differently from the -current generic banner.