From 0b90f15602238501f5bc17050caf20bc5d0b0e6d Mon Sep 17 00:00:00 2001 From: Trenton Holmes <797416+stumpylog@users.noreply.github.com> Date: Mon, 6 Apr 2026 13:05:29 -0700 Subject: [PATCH] docs: update plan with profiling results and final architecture Fill in TBD profiling table with actual before/after numbers, update post-implementation notes to reflect the search_ids + highlight_hits architecture, and add post-option1 profiling data. Co-Authored-By: Claude Opus 4.6 --- .../plans/2026-04-03-search-performance.md | 100 ++++++++++-------- .../plans/profiling-after-option1.txt | 78 ++++++++++++++ 2 files changed, 132 insertions(+), 46 deletions(-) create mode 100644 docs/superpowers/plans/profiling-after-option1.txt diff --git a/docs/superpowers/plans/2026-04-03-search-performance.md b/docs/superpowers/plans/2026-04-03-search-performance.md index b4429f282..0456cf7b5 100644 --- a/docs/superpowers/plans/2026-04-03-search-performance.md +++ b/docs/superpowers/plans/2026-04-03-search-performance.md @@ -1106,36 +1106,34 @@ Expected improvements: - **Paginated search**: Only page 2's 25 results get highlights instead of all 200 - **Backend search**: Direct comparison of highlight-all vs highlight-page -- [ ] **Step 4: Record comparison in the plan** +- [x] **Step 4: Record comparison in the plan** -Update this section with the actual numbers once profiling is complete: +Profiling results (200-document test corpus): -| Scenario | Metric | Before | After | Improvement | -| ------------------------- | ------------ | ------ | ----- | ----------- | -| Relevance search | Wall time | _TBD_ | _TBD_ | _TBD_ | -| Relevance search | Queries | _TBD_ | _TBD_ | _TBD_ | -| Relevance search | Memory delta | _TBD_ | _TBD_ | _TBD_ | -| Sorted search | Wall time | _TBD_ | _TBD_ | _TBD_ | -| Sorted search | Queries | _TBD_ | _TBD_ | _TBD_ | -| Paginated search | Wall time | _TBD_ | _TBD_ | _TBD_ | -| Backend 10k→25 highlights | Wall time | _TBD_ | _TBD_ | _TBD_ | -| Backend 10k→25 highlights | Memory delta | _TBD_ | _TBD_ | _TBD_ | +| Scenario | Metric | Before | After | Improvement | +| ---------------------- | ------------ | ---------- | ---------- | ----------- | +| Relevance search | Wall time | 0.962s | 0.917s | -5% | +| Relevance search | Queries | 33 | 33 | same | +| Relevance search | Memory delta | 16,557 KiB | 16,478 KiB | -0.5% | +| Sorted search | Wall time | 0.132s | 0.138s | ~same | +| Sorted search | Queries | 32 | 32 | same | +| Sorted search | Memory delta | 881 KiB | 792 KiB | -10% | +| Paginated search | Wall time | 0.140s | 0.132s | -6% | +| Paginated search | Memory delta | 868 KiB | 788 KiB | -9% | +| Selection data | Wall time | 0.166s | 0.157s | -5% | +| Selection data | Memory delta | 927 KiB | 837 KiB | -10% | +| Backend 10k highlights | Wall time | 0.018s | 0.019s | same | +| Backend 10k highlights | Memory delta | 89 KiB | 89 KiB | same | +| Backend 25 highlights | Wall time | 0.007s | 0.005s | -29% | +| Backend 25 highlights | Memory delta | 5.9 KiB | 5.9 KiB | same | -- [ ] **Step 5: Commit** +Notes: Relevance search is dominated by first-request import overhead (~16 MiB). +Memory savings scale with document count. The 10,000 hardcoded limit has been +removed entirely; search_ids() now returns all matches. -```bash -git add src/documents/tests/test_search_profiling.py docs/superpowers/plans/profiling-after.txt -git commit -m "test: add post-implementation profiling results" -``` +- [x] **Step 5: Commit** — Done (profiling data saved to `docs/superpowers/plans/profiling-after-option1.txt`) -- [ ] **Step 6: Clean up profiling artifacts** - -The profiling test file and `profiling.py` are temporary. Remove them: - -```bash -git rm src/documents/tests/test_search_profiling.py src/documents/profiling.py -git commit -m "chore: remove temporary profiling tests" -``` +- [x] **Step 6: Clean up profiling artifacts** — Done (removed `profiling.py` and `test_search_profiling.py`) --- @@ -1143,34 +1141,44 @@ git commit -m "chore: remove temporary profiling tests" ### What these changes accomplish -- **Task 1**: `search()` accepts `highlight_page`/`highlight_page_size` — only the displayed page pays the snippet cost. All hits still returned (DRF pagination works unchanged). -- **Task 2-3**: `search_ids()` and `more_like_this_ids()` provide an even lighter path when only IDs are needed. -- **Task 4**: Viewset passes `sort_field` through to Tantivy for natively-sortable fields, eliminating the ORM re-sort query. Passes `highlight_page` so only 25 hits get snippets instead of 10,000. +- **Task 1**: `search()` accepts `highlight_page`/`highlight_page_size` for backward compatibility. +- **Task 2-3**: `search_ids()` and `more_like_this_ids()` provide lightweight ID-only paths with no arbitrary cap. +- **Task 4**: Viewset passes `sort_field` through to Tantivy for natively-sortable fields, eliminating the ORM re-sort query. +- **Option 1 refactor** (post-plan): Replaced the `page_size=10000` overfetch entirely. The viewset now calls `search_ids()` for the full ID set (ints only, no cap), intersects with ORM, then calls `highlight_hits()` for just the displayed page (~25 docs). `TantivyRelevanceList` holds ordered IDs for count/selection_data and a small page of rich `SearchHit` dicts for serialization. +- **Code review fixes**: `_parse_query()` and `_apply_permission_filter()` helpers extracted to deduplicate 3+4 call sites. `SORT_FIELD_MAP`/`SORTABLE_FIELDS` promoted to class constants. `__getitem__` handles int keys. Empty ordering param handled correctly. ### DRF compatibility preserved -| Concern | Status | -| ----------------------------------------- | ------------------------------------------------------------- | -| `TantivyRelevanceList.__len__()` | Returns `len(self._hits)` — ALL hits, correct count | -| `TantivyRelevanceList.__getitem__(slice)` | Slices the full hit list — DRF pagination works | -| `get_all_result_ids()` | Extracts IDs from full hit list — unchanged | -| `count` in response | Correct — reflects all matching documents after ORM filtering | -| `next`/`previous` links | Correct — DRF computes from accurate count | -| Page N requests | Correct — DRF slices full list at `[(N-1)*size : N*size]` | +| Concern | Status | +| ----------------------------------------- | ------------------------------------------------------------------ | +| `TantivyRelevanceList.__len__()` | Returns `len(self._ordered_ids)` — ALL matching IDs, correct count | +| `TantivyRelevanceList.__getitem__(slice)` | Returns pre-fetched page_hits when aligned, stubs otherwise | +| `TantivyRelevanceList.__getitem__(int)` | Returns single SearchHit (from page_hits or stub) | +| `get_all_result_ids()` | Returns `ordered_ids` directly — no dict iteration | +| `count` in response | Correct — reflects all matching documents after ORM filtering | +| `next`/`previous` links | Correct — DRF computes from accurate count | +| Page N requests | Correct — DRF slices, gets pre-fetched page hits | ### Performance impact -| Operation | Before | After | -| ---------------------------------------- | ---------------------- | ----------------------------------------- | -| Snippet generations per search | Up to 10,000 | ~25 (page size) | -| Notes SnippetGenerator creations | Up to 10,000 (per hit) | ~25 (page size) | -| ORM sort query (Tantivy-sortable fields) | Always | Never (Tantivy sorts) | -| ORM sort query (custom fields) | Always | Still always (fallback) | -| `searcher.doc()` calls | Up to 10,000 | Up to 10,000 (unchanged — needed for IDs) | -| Tantivy searches per request | 1 | 1 | +| Operation | Before | After | +| ---------------------------------------- | ---------------------- | ------------------------------------------------- | +| Snippet generations per search | Up to 10,000 | ~25 (page size) via `highlight_hits()` | +| `searcher.doc()` calls for IDs | Up to 10,000 | All matches via `search_ids()` (ints, not dicts) | +| `searcher.doc()` calls for highlights | Up to 10,000 | ~25 via `highlight_hits()` (N individual lookups) | +| ORM sort query (Tantivy-sortable fields) | Always | Never (Tantivy sorts via `search_ids()`) | +| ORM sort query (custom fields) | Always | Still always (fallback) | +| Tantivy searches per request | 1 | 2 (`search_ids` + `highlight_hits`) | +| Hardcoded result cap | 10,000 | None (`searcher.num_docs`) | +| Memory per result (non-page hits) | ~100 bytes (SearchHit) | ~28 bytes (int) | + +### Known limitations + +- **`highlight_hits()` does N individual ID lookups**: tantivy-py does not expose a batch doc-address-by-ID API, so each page doc requires a separate `searcher.search(id_query, limit=1)`. Acceptable for page-sized batches (~25) but should not be called with thousands of IDs. +- **Text-based sort fields fall back to ORM**: `title`, `correspondent__name`, `document_type__name` produce different ordering in Tantivy (tokenized) vs ORM (collation), so they use the ORM sort path. ### What's NOT in this plan (future work) -- **Push ORM filters into Tantivy queries**: Would eliminate the ORM intersection (`filtered_qs.values_list`) and potentially reduce the 10k hit fetch. High effort, deferred. +- **Push ORM filters into Tantivy queries**: Would eliminate the ORM intersection (`filtered_qs.values_list`). High effort (~30 filter expressions to translate), deferred. Assessed as weeks of work. - **Tantivy fast-field ID extraction**: `searcher.doc()` loads the full stored document to get the ID. Tantivy's fast fields could provide IDs without loading stored docs. Depends on tantivy-py API support. -- **Adaptive overfetch limit**: The 10,000 limit is still fixed. Could be made smaller when ORM filters are absent, or adaptive based on historical filter rates. +- **Batch doc-address lookup**: Would eliminate the N individual lookups in `highlight_hits()`. Requires tantivy-py API changes or a workaround using term_set_query. diff --git a/docs/superpowers/plans/profiling-after-option1.txt b/docs/superpowers/plans/profiling-after-option1.txt new file mode 100644 index 000000000..033e94eac --- /dev/null +++ b/docs/superpowers/plans/profiling-after-option1.txt @@ -0,0 +1,78 @@ +Profiling data after Option 1: search_ids + page-only highlight_hits +===================================================================== +Run date: 2026-04-05 +Commit: 610ba2789 (feat: replace 10000 overfetch with search_ids + page-only highlights) +Test corpus: 200 documents + +============================================================ + Profile: relevance search (no ordering) +============================================================ + Wall time: 0.9167s + Queries: 33 (0.0000s) + Memory delta: 16477.8 KiB + Peak memory: 16504.5 KiB + +============================================================ + Profile: sorted search (ordering=created) +============================================================ + Wall time: 0.1378s + Queries: 32 (0.0000s) + Memory delta: 792.1 KiB + Peak memory: 818.9 KiB + +============================================================ + Profile: paginated search (page=2, page_size=25) +============================================================ + Wall time: 0.1322s + Queries: 32 (0.0000s) + Memory delta: 788.3 KiB + Peak memory: 815.2 KiB + +============================================================ + Profile: search with selection_data +============================================================ + Wall time: 0.1570s + Queries: 37 (0.0010s) + Memory delta: 837.3 KiB + Peak memory: 981.2 KiB + +============================================================ + Profile: backend.search(page_size=10000, all highlights) +============================================================ + Wall time: 0.0193s + Queries: 0 (0.0000s) + Memory delta: 88.6 KiB + Peak memory: 100.2 KiB + +============================================================ + Profile: backend.search(page_size=25) +============================================================ + Wall time: 0.0046s + Queries: 0 (0.0000s) + Memory delta: 5.9 KiB + Peak memory: 11.1 KiB + + +Comparison summary (200 docs): +============================== + +| Scenario | Baseline | After Option 1 | Change | +|---------------------------|-----------|----------------|--------------| +| Relevance — wall | 0.962s | 0.917s | -5% | +| Relevance — memory | 16557 KiB | 16478 KiB | -0.5% | +| Sorted — wall | 0.132s | 0.138s | ~same | +| Sorted — memory | 881 KiB | 792 KiB | -10% | +| Paginated — wall | 0.140s | 0.132s | -6% | +| Paginated — memory | 868 KiB | 788 KiB | -9% | +| Selection data — wall | 0.166s | 0.157s | -5% | +| Selection data — memory | 927 KiB | 837 KiB | -10% | +| Backend 10k — wall | 0.018s | 0.019s | same | +| Backend 10k — memory | 89 KiB | 89 KiB | same | +| Backend 25 — wall | 0.007s | 0.005s | -29% | +| Backend 25 — memory | 5.9 KiB | 5.9 KiB | same | + +Notes: +- The 10000 hardcoded limit has been removed; search_ids() now returns all matches. +- Relevance search is dominated by first-request import overhead (~16 MiB). +- Memory savings will scale with document count (ints vs SearchHit dicts). +- Backend-only 10k test is unchanged because it still calls search() directly.