From 467e573d24b3e12cf5632c807b3bf008efde08b0 Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Sat, 16 May 2026 23:28:27 +0000
Subject: [PATCH] =?UTF-8?q?docs/PHASE8:=20review=20fold-in=20=E2=80=94=202?=
 =?UTF-8?q?=20BLOCKERs=20+=204=20CONCERNs=20+=204=20NITs?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Sonnet-reviewed per reviews-use-sonnet memory directive.

BLOCKERs (RESOLVED in-place):

R1. §5 estimate_tokens pseudocode missing per-turn cache pattern.
    Prose described it; code block called tokenize_fn unconditionally.
    Implementer following code verbatim would hit the O(N round-
    trips per call) perf gap the prose flagged. Code block now
    shows explicit `if t._tokens then ... else t._tokens = ... end`.

R2. enforce_budget loop can spin forever when system_prompt alone
    exceeds token_budget (e.g. 5KB project block + budget=4096 +
    zero turns -> turns can't shrink further but OR-condition stays
    true). Fix: AND `#self.turns > 0` guard on the loop. §13 commit
    3 row shows the explicit Lua-syntax condition.

CONCERNs (FOLDED):

R3. :cost detail per-slot ~est=N annotation was semantically
    undefined — accumulator sum (cumulative across calls + evicted
    turns) vs current-snapshot estimate are incommensurable. §6
    reworked: ONE trailing summary line "[estimated session ctx:
    N tokens; token_budget=M (X% used)]" instead of per-slot
    annotations. §13 commit 4 aligned.

R4. tokenize_fn closure MUST reference active_cfg as upvalue (NOT
    capture by value). Subtle but easy to miss — §13 commit 4 now
    spells out the correct vs wrong patterns explicitly.

R5. 2s tokenize timeout can spuriously cache-as-unsupported when
    llama.cpp is busy with a concurrent completion (single-threaded
    inference; /tokenize queues behind). Documented in §9; v1
    ships 2s, revisit during verify if it bites.

R6. Per-endpoint cache key conflated two same-endpoint/different-
    model presets (B1: /tokenize ignores the model field). Cache
    key simplified to endpoint-only. One probe per endpoint per
    session; if a future broker honors the model field, revisit.

NITs (APPLIED):

N1. §13 commit 3 `OR`/`AND` -> Lua-syntax `or`/`and`.
N2. §10 Q-T5 Resolution-target cell filled in (was blank after B1).
N3. §6 / §8 / §13 commit 4 now describe a CONSISTENT approach
    (trailing summary line; per-slot annotation dropped).
N4. Status header tree-hash updated to current (aa64ad3 -> stays
    fresh through review fold-in; commit 5 will refresh again
    at "Implement" status).

PHASE8.md now 622 lines (was 454 after plan). +168/-61. Ready for
implementation phase 6 of the inner loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/PHASE8.md | 227 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 167 insertions(+), 60 deletions(-)

diff --git a/docs/PHASE8.md b/docs/PHASE8.md
index 4978935..f8afe14 100644
--- a/docs/PHASE8.md
+++ b/docs/PHASE8.md
@@ -2,9 +2,91 @@
 
 **Project:** aish — AI-augmented conversational shell
 **Document:** Phase 8 Requirements, Architecture & Design Decisions
-**Status:** Plan (formulate + analyze + baseline complete; tree at `79bd40d`)
+**Status:** Plan + review fold-in (tree at `aa64ad3`)
 **Date:** 2026-05-16
 
+**Review findings (independent Sonnet agent, 2026-05-16) — 2 BLOCKERs
+resolved in-place, 4 CONCERNs folded, 4 NITs applied:**
+
+R1 (BLOCKER, RESOLVED). **§5 pseudocode missing per-turn cache pattern.**
+    The prose under the §5 code block correctly describes the cache,
+    but the code block itself calls `self.tokenize_fn(t.content)`
+    unconditionally — an implementer following the code would produce
+    the O(N round-trips per call) behavior the prose flags as too
+    slow. **Fix:** §5 code block updated to show the explicit
+    cache-read-then-write pattern (`if t._tokens then ... else
+    t._tokens = self.tokenize_fn(t.content) end`). §13 commit 2 row
+    also calls this out explicitly.
+
+R2 (BLOCKER, RESOLVED). **enforce_budget loop can spin indefinitely
+    when system_prompt alone exceeds token_budget.** If `[project]`
+    block is 5000 tokens and `token_budget = 4096`, the loop's OR
+    condition stays true even when `#turns == 0` — `table.remove`
+    is a no-op, the loop never exits. **Fix:** §13 commit 3 row
+    updated to specify the explicit guard: `while (#self.turns >
+    self.max_turns or self:estimate_tokens() > self.token_budget)
+    and #self.turns > 0 do`. When turns are exhausted, the loop
+    exits gracefully even if the system prompt blows the budget
+    (caller is on their own to reduce :project, :memory, etc.).
+
+R3 (CONCERN, FOLDED). **`:cost detail` comparison semantically
+    undefined.** Sum-of-prompt_tokens across all calls (accumulator)
+    vs current-snapshot estimate are incommensurable — sessions
+    with evictions ALWAYS show divergence not because heuristic is
+    wrong but because they measure different things. **Resolution:**
+    §6 reworked to drop the per-slot `~est=N` inline annotation
+    (which conflated the two); instead show a SINGLE trailing
+    "[estimated session ctx: N tokens]" line under :cost detail.
+    Cleanly separates the running-total accumulator from the
+    current-snapshot estimate. §13 commit 4 already pointed to this
+    direction — now §6 matches.
+
+R4 (CONCERN, FOLDED). **tokenize_fn closure must reference `active_cfg`
+    by upvalue, not by value-capture.** If the implementer writes
+    `local cfg = active_cfg; return function(text) ... cfg ... end`,
+    the closure won't follow `:model` switches. **Fix:** §13 commit
+    4 row gains an explicit code note: the closure MUST be
+    `function(text) return broker.token_count(active_cfg, text) end`
+    — direct upvalue reference. A5 analysis verified upvalue
+    semantics; now spelled out so the implementer doesn't subtly
+    miss it.
+
+R5 (CONCERN, FOLDED). **2s tokenize timeout can spuriously cache as
+    unsupported when llama.cpp is busy serving a concurrent
+    completion.** llama.cpp is single-threaded for inference; a
+    /tokenize request that arrives mid-generation queues behind
+    inference and may exceed the 2s cap. The capability would then
+    cache as `false` for the rest of the session, even though the
+    endpoint IS capable. **Fix:** §9 risk row added documenting
+    this. Mitigation: 2s is reasonable for IDLE responses but if
+    practical problems surface, bump to 5s or make configurable
+    (`cfg.tokenize.timeout_ms`). v1 ships 2s; revisit in verify if
+    it bites.
+
+R6 (CONCERN, FOLDED). **Per-endpoint cache key conflates two
+    same-endpoint/different-model presets.** B1 confirmed
+    /tokenize ignores the model field, so two probes per session
+    when one would suffice. **Fix:** §4 cache key SIMPLIFIED to
+    just `model_cfg.endpoint` (B1-justified). Same-endpoint
+    presets share one cache entry; one probe per endpoint per
+    session, not per (endpoint, model). For a future broker that
+    DOES honor the model field, this design choice would need
+    revisiting — documented inline.
+
+R-N1..N4 (NITs, APPLIED):
+  N1. §13 commit 3 condition uses uppercase `OR`/`AND` — corrected
+      to Lua's lowercase `or`/`and`.
+  N2. §10 Q-T5 row's "Resolution target" cell was empty; now reads
+      "Baseline (B1)" for consistency.
+  N3. §6 outdated inline `~est=N` description removed; new approach
+      (single trailing summary line) is documented; §8 out-of-scope
+      bullet about per-call comparison stays as the explicit "we
+      considered, rejected" record.
+  N4. PHASE8.md status header (formerly carrying a stale tree hash
+      that would drift before implementation) now references the
+      latest tree as of this fold-in (`aa64ad3`). Commit 5's status
+      bump to "Implement" will refresh it again at that point.
+
 **Analyze findings (2026-05-16):**
 
 A1. **enforce_budget ONLY checks max_turns, not token_budget — major
@@ -195,13 +277,15 @@ No new module files.
 
 ## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)`
 
-```lua
--- Per-endpoint capability cache (session-scoped local in broker.lua)
-local _tokenize_capable = {}    -- [endpoint .. "/" .. model] = true | false
+R6-revised — cache key is endpoint-only (B1: /tokenize ignores the
+model field so two presets sharing an endpoint share one cache entry):
 
-local function _cache_key(model_cfg)
-    return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
-end
+```lua
+-- Per-endpoint capability cache (session-scoped local in broker.lua).
+-- Keyed by endpoint only (B1: hossenfelder's /tokenize ignores the
+-- model field; same endpoint -> same tokenization). If a future
+-- broker honors the model field, revisit this keying.
+local _tokenize_capable = {}    -- [endpoint] = true | false
 
 function M.token_count(model_cfg, text)
     text = text or ""
@@ -209,34 +293,34 @@ function M.token_count(model_cfg, text)
     if not (model_cfg and model_cfg.endpoint) then
         return math.floor(#text / 4)   -- pure fallback
     end
-    local key = _cache_key(model_cfg)
-    local cap = _tokenize_capable[key]
+    local ep  = model_cfg.endpoint
+    local cap = _tokenize_capable[ep]
     if cap == false then
         return math.floor(#text / 4)
     end
     -- cap == nil OR cap == true; try the endpoint.
-    local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
+    local url = ep:gsub("/+$", "") .. "/tokenize"
     local body = json.encode({ content = text, model = model_cfg.model })
     local out, status = curl.post(url, body,
         { "Content-Type: application/json" },
         2000)  -- 2s timeout
     if not (status == 200 and out) then
-        _tokenize_capable[key] = false
+        _tokenize_capable[ep] = false
         return math.floor(#text / 4)
     end
     local doc = json.decode(out)
     local toks = doc and doc.tokens
     if type(toks) ~= "table" then
-        _tokenize_capable[key] = false
+        _tokenize_capable[ep] = false
         return math.floor(#text / 4)
     end
-    _tokenize_capable[key] = true
+    _tokenize_capable[ep] = true
     return #toks
 end
 
 function M.tokenize_supported(model_cfg)
-    if not model_cfg then return nil end
-    return _tokenize_capable[_cache_key(model_cfg)]
+    if not (model_cfg and model_cfg.endpoint) then return nil end
+    return _tokenize_capable[model_cfg.endpoint]
 end
 ```
 
@@ -246,6 +330,8 @@ Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status).
 
 ## 5. Pillar 3 — `Context:estimate_tokens` widening
 
+R1-revised — cache pattern is IN the reference code, not just prose:
+
 ```lua
 function M.new(opts)
     ...
@@ -261,9 +347,18 @@ end
 
 function Context:estimate_tokens()
     if self.tokenize_fn then
+        -- system_prompt is recomposed per call (memory/project/summary
+        -- blocks are dynamic) — re-tokenize every estimate. Bounded
+        -- by one round-trip.
         local n = self.tokenize_fn(self.system_prompt)
+        -- R1: per-turn cache on the turn dict itself. Turn content
+        -- is immutable after append (A8) so the cache never goes
+        -- stale; turns dying with :reset takes the cache with them.
         for _, t in ipairs(self.turns) do
-            n = n + self.tokenize_fn(t.content)
+            if t._tokens == nil then
+                t._tokens = self.tokenize_fn(t.content)
+            end
+            n = n + t._tokens
         end
         return n
     end
@@ -274,12 +369,11 @@ function Context:estimate_tokens()
 end
 ```
 
-Performance note: with N turns of average ~500 chars each, `estimate_tokens`
-fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for
-per-prompt eviction checks. **Mitigation**: per-turn token count cached
-ON the turn dict (`turn._tokens`) the first time it's counted; only
-re-tokenized if `turn._tokens` is nil. Set at append time when
-tokenize_fn is present; otherwise lazily on first estimate.
+Performance: first call after a fresh session fires N+1 round-trips
+(N turns + 1 system prompt). Subsequent calls fire 1 (system prompt)
++ N dict lookups. For N=40, that's 40 × ~30ms = 1.2s one-time + ~30ms
+amortized per call — acceptable for the prompt-template render path
+AND the per-step Norris enforce_budget call.
 
 ---
 
@@ -291,25 +385,27 @@ Current `:cost detail` (Phase 7) shows:
   anthropic/claude-haiku-4.5 main                 1 calls,    179 /      8 tokens, $0.000219
 ```
 
-The `179 / 8` is `prompt_tokens / completion_tokens` from the actual
-broker usage payload.
+The `179 / 8` is `prompt_tokens / completion_tokens` SUMMED across all
+calls in that slot — including any turns later evicted from context.
 
-Phase 8 extension: for each slot, also compute an estimated count via
-`broker.token_count(model_cfg_for_this_slot, ...)` over the
-TURNS THAT CONTRIBUTED to this slot. But that's stateful and
-expensive — simpler: show the SUM of `prompt_tokens` (actual) and
-the SUM of `estimate_tokens()` (heuristic OR endpoint-based, depending
-on what tokenize_fn is wired). If disagreement >10%, annotate.
+R3-revised Phase 8 extension: an inline per-slot "estimated" annotation
+would conflate two different things — the per-slot prompt_tokens is a
+cumulative running total (across calls AND past evicted turns), while
+`estimate_tokens()` is a current-snapshot measurement (in-memory turns
+ONLY). Comparing them directly is misleading; sessions with evictions
+would always show divergence.
 
-Simplified format:
+Instead, add a SINGLE trailing summary line after the slot rows:
 
 ```
-  anthropic/claude-haiku-4.5 main                 1 calls,    179 ~est=164 / 8 tokens, $0.000219
-                                                                    ^^^^^^^^^^ shown when |actual-est|/actual > 0.10
+  ... per-slot rows ...
+  [estimated session ctx: 412 tokens; token_budget=4096 (10% used)]
 ```
 
-The `~est=N` annotation only renders when the disagreement exceeds the
-10% threshold. Silent otherwise.
+The estimate is `ctx:estimate_tokens()` over the current ctx (system
+prompt + live turns); the percentage gives at-a-glance budget
+utilization. This is purely informational; no annotation on the
+accumulator rows themselves.
 
 ---
 
@@ -366,6 +462,7 @@ after one probe). No surprise; document in config example.
 | `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
 | Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
 | A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. |
+| R5 — 2s tokenize timeout could spuriously cache-as-unsupported when the llama.cpp backend is busy with a concurrent completion (single-threaded inference, /tokenize queues behind it). Once cached false, char/4 takes over for the rest of the session even though the endpoint IS capable. | 2s is fine for idle responses; bumping to 5s or making it configurable (`cfg.tokenize.timeout_ms`) is a v1.1 polish if it bites in practice. Documented; revisit during verify. |
 
 ---
 
@@ -377,7 +474,7 @@ after one probe). No surprise; document in config example.
 | Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. |
 | Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. |
 | Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. |
-| Q-T5 | `/tokenize` honors `model` field | **Baseline** (not yet probed in detail — assumed yes since llama.cpp echoes `model` in completions; needs explicit multi-model probe). |
+| Q-T5 | `/tokenize` honors `model` field | B1 RESOLVED — `/tokenize` IGNORES the model field; returns the loaded backend's tokenization. Acceptable (BPE >> char/4 even with wrong tokenizer); cache key simplified to endpoint-only per R6. |
 | Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. |
 
 ---
@@ -439,37 +536,47 @@ commit feature smoke).
      turn._tokens.
 
 3. **`context.lua` — enforce_budget honors token_budget (pillar 5).**
-   - Existing `while #self.turns > self.max_turns` loop extended to:
-     `while #self.turns > self.max_turns OR self:estimate_tokens() > self.token_budget do`.
-   - Per-pair eviction otherwise unchanged (summarize callback, status_evictions).
+   - Existing `while #self.turns > self.max_turns` loop extended.
+     **R2 guard** — when system_prompt alone exceeds budget AND
+     turns are empty, the loop must exit (not spin trying to evict
+     nothing). Correct condition:
+     ```lua
+     while (#self.turns > self.max_turns
+            or self:estimate_tokens() > self.token_budget)
+           and #self.turns > 0 do
+     ```
+     Lowercase `or`/`and` per Lua syntax (N1).
+   - Per-pair eviction otherwise unchanged (summarize callback,
+     status_evictions).
    - The estimate_tokens call inside the loop is potentially expensive
      under tokenize_fn — but commit #2's per-turn cache means each
-     iteration is O(#turns) dict-lookups. Acceptable for the eviction
-     hot path.
-   - Smoke: Context with `token_budget = 100`, max_turns = 100, fill
-     with turns until `estimate_tokens() > 100`, then call enforce_budget
-     — should evict until under budget. With char/4: eviction happens
-     at known char counts. With tokenize_fn stub returning 50/turn:
-     evicts down to <= 100/50 = 2 turns.
+     iteration is O(#turns) dict-lookups after the first. Acceptable
+     for the eviction hot path.
+   - Smoke: (a) Context with `token_budget = 100`, max_turns = 100,
+     fill with turns until `estimate_tokens() > 100`, then call
+     enforce_budget — should evict until under budget. (b) R2 case:
+     synthetic system_prompt of 500 chars (char/4 = 125 tokens) +
+     token_budget = 100 + zero turns — call enforce_budget; must
+     return immediately, not spin.
 
-4. **`repl.lua` — tokenize_fn wiring + :cost detail est-vs-actual.**
+4. **`repl.lua` — tokenize_fn wiring + :cost detail estimate row.**
    - When `config.tokenize and config.tokenize.use_endpoint`, build
      `ctx_opts.tokenize_fn = function(text)
         return broker.token_count(active_cfg, text)
-      end`. The closure captures `active_cfg` upval per A5 — follows
-     `:model` switches naturally.
-   - `:cost detail` extension: per slot, compute the sum of actual
-     `prompt_tokens` (already in usage_totals) AND an "estimated"
-     count via... actually this needs a rethink: the accumulator
-     doesn't store per-turn data, just running sums. The estimate
-     for comparison would need to be over ctx's CURRENT content
-     (a single snapshot, not per-call). Simplest: show a SECOND
-     line under `:cost detail` with the current `estimate_tokens()`
-     value, labeled "[estimated session ctx: N tokens]". Disagreement
-     with the accumulator's prompt_tokens TOTAL surfaced inline.
+      end`. **R4: the closure body MUST reference `active_cfg`
+     directly as an upvalue, NOT capture it by value** (`local cfg
+     = active_cfg; return function() ... cfg ... end` would freeze
+     to the value at closure-construction time and miss `:model`
+     switches). A5 verified upvalue semantics in Lua.
+   - `:cost detail` extension per R3: ONE trailing summary line under
+     the existing per-slot rows showing
+     `[estimated session ctx: N tokens; token_budget=M (X% used)]`.
+     N comes from `ctx:estimate_tokens()` (current snapshot, NOT a
+     comparison against the accumulator sum — they measure different
+     things). M is `ctx.token_budget`. X% = N/M × 100.
    - Smoke: with use_endpoint=true on a local-only session, observe
-     enforce_budget eviction timing vs disable; observe :cost detail
-     estimate row.
+     enforce_budget eviction timing vs disabled; observe :cost detail
+     estimate row updates as turns accumulate.
 
 5. **`config.lua` example block + `docs/PHASE8.md` status bump.**
    - Commented-out `tokenize = { use_endpoint = true }` block in