docs: update plan with profiling results and final architecture

Fill in TBD profiling table with actual before/after numbers, update post-implementation notes to reflect the search_ids + highlight_hits architecture, and add post-option1 profiling data. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-07-15 00:14:53 +00:00 · 2026-04-06 13:05:29 -07:00
parent 0b5b6fdad5
commit 0b90f15602
2 changed files with 132 additions and 46 deletions
@@ -1106,36 +1106,34 @@ Expected improvements:
 - **Paginated search**: Only page 2's 25 results get highlights instead of all 200
 - **Backend search**: Direct comparison of highlight-all vs highlight-page

- [ ] **Step 4: Record comparison in the plan**
+- [x] **Step 4: Record comparison in the plan**

-Update this section with the actual numbers once profiling is complete:
+Profiling results (200-document test corpus):

-| Scenario                  | Metric       | Before | After | Improvement |
-| ------------------------- | ------------ | ------ | ----- | ----------- |
-| Relevance search          | Wall time    | _TBD_  | _TBD_ | _TBD_       |
-| Relevance search          | Queries      | _TBD_  | _TBD_ | _TBD_       |
-| Relevance search          | Memory delta | _TBD_  | _TBD_ | _TBD_       |
-| Sorted search             | Wall time    | _TBD_  | _TBD_ | _TBD_       |
-| Sorted search             | Queries      | _TBD_  | _TBD_ | _TBD_       |
-| Paginated search          | Wall time    | _TBD_  | _TBD_ | _TBD_       |
-| Backend 10k→25 highlights | Wall time    | _TBD_  | _TBD_ | _TBD_       |
-| Backend 10k→25 highlights | Memory delta | _TBD_  | _TBD_ | _TBD_       |
+| Scenario               | Metric       | Before     | After      | Improvement |
+| ---------------------- | ------------ | ---------- | ---------- | ----------- |
+| Relevance search       | Wall time    | 0.962s     | 0.917s     | -5%         |
+| Relevance search       | Queries      | 33         | 33         | same        |
+| Relevance search       | Memory delta | 16,557 KiB | 16,478 KiB | -0.5%       |
+| Sorted search          | Wall time    | 0.132s     | 0.138s     | ~same       |
+| Sorted search          | Queries      | 32         | 32         | same        |
+| Sorted search          | Memory delta | 881 KiB    | 792 KiB    | -10%        |
+| Paginated search       | Wall time    | 0.140s     | 0.132s     | -6%         |
+| Paginated search       | Memory delta | 868 KiB    | 788 KiB    | -9%         |
+| Selection data         | Wall time    | 0.166s     | 0.157s     | -5%         |
+| Selection data         | Memory delta | 927 KiB    | 837 KiB    | -10%        |
+| Backend 10k highlights | Wall time    | 0.018s     | 0.019s     | same        |
+| Backend 10k highlights | Memory delta | 89 KiB     | 89 KiB     | same        |
+| Backend 25 highlights  | Wall time    | 0.007s     | 0.005s     | -29%        |
+| Backend 25 highlights  | Memory delta | 5.9 KiB    | 5.9 KiB    | same        |

- [ ] **Step 5: Commit**
+Notes: Relevance search is dominated by first-request import overhead (~16 MiB).
+Memory savings scale with document count. The 10,000 hardcoded limit has been
+removed entirely; search_ids() now returns all matches.

-```bash
-git add src/documents/tests/test_search_profiling.py docs/superpowers/plans/profiling-after.txt
-git commit -m "test: add post-implementation profiling results"
-```
+- [x] **Step 5: Commit** — Done (profiling data saved to `docs/superpowers/plans/profiling-after-option1.txt`)

- [ ] **Step 6: Clean up profiling artifacts**
-
-The profiling test file and `profiling.py` are temporary. Remove them:
-
-```bash
-git rm src/documents/tests/test_search_profiling.py src/documents/profiling.py
-git commit -m "chore: remove temporary profiling tests"
-```
+- [x] **Step 6: Clean up profiling artifacts** — Done (removed `profiling.py` and `test_search_profiling.py`)

 ---

@@ -1143,34 +1141,44 @@ git commit -m "chore: remove temporary profiling tests"

 ### What these changes accomplish

- **Task 1**: `search()` accepts `highlight_page`/`highlight_page_size` — only the displayed page pays the snippet cost. All hits still returned (DRF pagination works unchanged).
- **Task 2-3**: `search_ids()` and `more_like_this_ids()` provide an even lighter path when only IDs are needed.
- **Task 4**: Viewset passes `sort_field` through to Tantivy for natively-sortable fields, eliminating the ORM re-sort query. Passes `highlight_page` so only 25 hits get snippets instead of 10,000.
+- **Task 1**: `search()` accepts `highlight_page`/`highlight_page_size` for backward compatibility.
+- **Task 2-3**: `search_ids()` and `more_like_this_ids()` provide lightweight ID-only paths with no arbitrary cap.
+- **Task 4**: Viewset passes `sort_field` through to Tantivy for natively-sortable fields, eliminating the ORM re-sort query.
+- **Option 1 refactor** (post-plan): Replaced the `page_size=10000` overfetch entirely. The viewset now calls `search_ids()` for the full ID set (ints only, no cap), intersects with ORM, then calls `highlight_hits()` for just the displayed page (~25 docs). `TantivyRelevanceList` holds ordered IDs for count/selection_data and a small page of rich `SearchHit` dicts for serialization.
+- **Code review fixes**: `_parse_query()` and `_apply_permission_filter()` helpers extracted to deduplicate 3+4 call sites. `SORT_FIELD_MAP`/`SORTABLE_FIELDS` promoted to class constants. `__getitem__` handles int keys. Empty ordering param handled correctly.

 ### DRF compatibility preserved

-| Concern                                   | Status                                                        |
-| ----------------------------------------- | ------------------------------------------------------------- |
-| `TantivyRelevanceList.__len__()`          | Returns `len(self._hits)` — ALL hits, correct count           |
-| `TantivyRelevanceList.__getitem__(slice)` | Slices the full hit list — DRF pagination works               |
-| `get_all_result_ids()`                    | Extracts IDs from full hit list — unchanged                   |
-| `count` in response                       | Correct — reflects all matching documents after ORM filtering |
-| `next`/`previous` links                   | Correct — DRF computes from accurate count                    |
-| Page N requests                           | Correct — DRF slices full list at `[(N-1)*size : N*size]`     |
+| Concern                                   | Status                                                             |
+| ----------------------------------------- | ------------------------------------------------------------------ |
+| `TantivyRelevanceList.__len__()`          | Returns `len(self._ordered_ids)` — ALL matching IDs, correct count |
+| `TantivyRelevanceList.__getitem__(slice)` | Returns pre-fetched page_hits when aligned, stubs otherwise        |
+| `TantivyRelevanceList.__getitem__(int)`   | Returns single SearchHit (from page_hits or stub)                  |
+| `get_all_result_ids()`                    | Returns `ordered_ids` directly — no dict iteration                 |
+| `count` in response                       | Correct — reflects all matching documents after ORM filtering      |
+| `next`/`previous` links                   | Correct — DRF computes from accurate count                         |
+| Page N requests                           | Correct — DRF slices, gets pre-fetched page hits                   |

 ### Performance impact

-| Operation                                | Before                 | After                                     |
-| ---------------------------------------- | ---------------------- | ----------------------------------------- |
-| Snippet generations per search           | Up to 10,000           | ~25 (page size)                           |
-| Notes SnippetGenerator creations         | Up to 10,000 (per hit) | ~25 (page size)                           |
-| ORM sort query (Tantivy-sortable fields) | Always                 | Never (Tantivy sorts)                     |
-| ORM sort query (custom fields)           | Always                 | Still always (fallback)                   |
-| `searcher.doc()` calls                   | Up to 10,000           | Up to 10,000 (unchanged — needed for IDs) |
-| Tantivy searches per request             | 1                      | 1                                         |
+| Operation                                | Before                 | After                                             |
+| ---------------------------------------- | ---------------------- | ------------------------------------------------- |
+| Snippet generations per search           | Up to 10,000           | ~25 (page size) via `highlight_hits()`            |
+| `searcher.doc()` calls for IDs           | Up to 10,000           | All matches via `search_ids()` (ints, not dicts)  |
+| `searcher.doc()` calls for highlights    | Up to 10,000           | ~25 via `highlight_hits()` (N individual lookups) |
+| ORM sort query (Tantivy-sortable fields) | Always                 | Never (Tantivy sorts via `search_ids()`)          |
+| ORM sort query (custom fields)           | Always                 | Still always (fallback)                           |
+| Tantivy searches per request             | 1                      | 2 (`search_ids` + `highlight_hits`)               |
+| Hardcoded result cap                     | 10,000                 | None (`searcher.num_docs`)                        |
+| Memory per result (non-page hits)        | ~100 bytes (SearchHit) | ~28 bytes (int)                                   |
+
+### Known limitations
+
+- **`highlight_hits()` does N individual ID lookups**: tantivy-py does not expose a batch doc-address-by-ID API, so each page doc requires a separate `searcher.search(id_query, limit=1)`. Acceptable for page-sized batches (~25) but should not be called with thousands of IDs.
+- **Text-based sort fields fall back to ORM**: `title`, `correspondent__name`, `document_type__name` produce different ordering in Tantivy (tokenized) vs ORM (collation), so they use the ORM sort path.

 ### What's NOT in this plan (future work)

- **Push ORM filters into Tantivy queries**: Would eliminate the ORM intersection (`filtered_qs.values_list`) and potentially reduce the 10k hit fetch. High effort, deferred.
+- **Push ORM filters into Tantivy queries**: Would eliminate the ORM intersection (`filtered_qs.values_list`). High effort (~30 filter expressions to translate), deferred. Assessed as weeks of work.
 - **Tantivy fast-field ID extraction**: `searcher.doc()` loads the full stored document to get the ID. Tantivy's fast fields could provide IDs without loading stored docs. Depends on tantivy-py API support.
- **Adaptive overfetch limit**: The 10,000 limit is still fixed. Could be made smaller when ORM filters are absent, or adaptive based on historical filter rates.
+- **Batch doc-address lookup**: Would eliminate the N individual lookups in `highlight_hits()`. Requires tantivy-py API changes or a workaround using term_set_query.
@@ -0,0 +1,78 @@
+Profiling data after Option 1: search_ids + page-only highlight_hits
+=====================================================================
+Run date: 2026-04-05
+Commit: 610ba2789 (feat: replace 10000 overfetch with search_ids + page-only highlights)
+Test corpus: 200 documents
+
+============================================================
+  Profile: relevance search (no ordering)
+============================================================
+  Wall time:    0.9167s
+  Queries:      33 (0.0000s)
+  Memory delta: 16477.8 KiB
+  Peak memory:  16504.5 KiB
+
+============================================================
+  Profile: sorted search (ordering=created)
+============================================================
+  Wall time:    0.1378s
+  Queries:      32 (0.0000s)
+  Memory delta: 792.1 KiB
+  Peak memory:  818.9 KiB
+
+============================================================
+  Profile: paginated search (page=2, page_size=25)
+============================================================
+  Wall time:    0.1322s
+  Queries:      32 (0.0000s)
+  Memory delta: 788.3 KiB
+  Peak memory:  815.2 KiB
+
+============================================================
+  Profile: search with selection_data
+============================================================
+  Wall time:    0.1570s
+  Queries:      37 (0.0010s)
+  Memory delta: 837.3 KiB
+  Peak memory:  981.2 KiB
+
+============================================================
+  Profile: backend.search(page_size=10000, all highlights)
+============================================================
+  Wall time:    0.0193s
+  Queries:      0 (0.0000s)
+  Memory delta: 88.6 KiB
+  Peak memory:  100.2 KiB
+
+============================================================
+  Profile: backend.search(page_size=25)
+============================================================
+  Wall time:    0.0046s
+  Queries:      0 (0.0000s)
+  Memory delta: 5.9 KiB
+  Peak memory:  11.1 KiB
+
+
+Comparison summary (200 docs):
+==============================
+
+| Scenario                  | Baseline  | After Option 1 | Change       |
+|---------------------------|-----------|----------------|--------------|
+| Relevance — wall          | 0.962s    | 0.917s         | -5%          |
+| Relevance — memory        | 16557 KiB | 16478 KiB      | -0.5%        |
+| Sorted — wall             | 0.132s    | 0.138s         | ~same        |
+| Sorted — memory           | 881 KiB   | 792 KiB        | -10%         |
+| Paginated — wall          | 0.140s    | 0.132s         | -6%          |
+| Paginated — memory        | 868 KiB   | 788 KiB        | -9%          |
+| Selection data — wall     | 0.166s    | 0.157s         | -5%          |
+| Selection data — memory   | 927 KiB   | 837 KiB        | -10%         |
+| Backend 10k — wall        | 0.018s    | 0.019s         | same         |
+| Backend 10k — memory      | 89 KiB    | 89 KiB         | same         |
+| Backend 25 — wall         | 0.007s    | 0.005s         | -29%         |
+| Backend 25 — memory       | 5.9 KiB   | 5.9 KiB        | same         |
+
+Notes:
+- The 10000 hardcoded limit has been removed; search_ids() now returns all matches.
+- Relevance search is dominated by first-request import overhead (~16 MiB).
+- Memory savings will scale with document count (ints vs SearchHit dicts).
+- Backend-only 10k test is unchanged because it still calls search() directly.