From 467e573d24b3e12cf5632c807b3bf008efde08b0 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Sat, 16 May 2026 23:28:27 +0000 Subject: [PATCH] =?UTF-8?q?docs/PHASE8:=20review=20fold-in=20=E2=80=94=202?= =?UTF-8?q?=20BLOCKERs=20+=204=20CONCERNs=20+=204=20NITs?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Sonnet-reviewed per reviews-use-sonnet memory directive. BLOCKERs (RESOLVED in-place): R1. §5 estimate_tokens pseudocode missing per-turn cache pattern. Prose described it; code block called tokenize_fn unconditionally. Implementer following code verbatim would hit the O(N round- trips per call) perf gap the prose flagged. Code block now shows explicit `if t._tokens then ... else t._tokens = ... end`. R2. enforce_budget loop can spin forever when system_prompt alone exceeds token_budget (e.g. 5KB project block + budget=4096 + zero turns -> turns can't shrink further but OR-condition stays true). Fix: AND `#self.turns > 0` guard on the loop. §13 commit 3 row shows the explicit Lua-syntax condition. CONCERNs (FOLDED): R3. :cost detail per-slot ~est=N annotation was semantically undefined — accumulator sum (cumulative across calls + evicted turns) vs current-snapshot estimate are incommensurable. §6 reworked: ONE trailing summary line "[estimated session ctx: N tokens; token_budget=M (X% used)]" instead of per-slot annotations. §13 commit 4 aligned. R4. tokenize_fn closure MUST reference active_cfg as upvalue (NOT capture by value). Subtle but easy to miss — §13 commit 4 now spells out the correct vs wrong patterns explicitly. R5. 2s tokenize timeout can spuriously cache-as-unsupported when llama.cpp is busy with a concurrent completion (single-threaded inference; /tokenize queues behind). Documented in §9; v1 ships 2s, revisit during verify if it bites. R6. Per-endpoint cache key conflated two same-endpoint/different- model presets (B1: /tokenize ignores the model field). Cache key simplified to endpoint-only. One probe per endpoint per session; if a future broker honors the model field, revisit. NITs (APPLIED): N1. §13 commit 3 `OR`/`AND` -> Lua-syntax `or`/`and`. N2. §10 Q-T5 Resolution-target cell filled in (was blank after B1). N3. §6 / §8 / §13 commit 4 now describe a CONSISTENT approach (trailing summary line; per-slot annotation dropped). N4. Status header tree-hash updated to current (aa64ad3 -> stays fresh through review fold-in; commit 5 will refresh again at "Implement" status). PHASE8.md now 622 lines (was 454 after plan). +168/-61. Ready for implementation phase 6 of the inner loop. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/PHASE8.md | 227 ++++++++++++++++++++++++++++++++++++------------- 1 file changed, 167 insertions(+), 60 deletions(-) diff --git a/docs/PHASE8.md b/docs/PHASE8.md index 4978935..f8afe14 100644 --- a/docs/PHASE8.md +++ b/docs/PHASE8.md @@ -2,9 +2,91 @@ **Project:** aish — AI-augmented conversational shell **Document:** Phase 8 Requirements, Architecture & Design Decisions -**Status:** Plan (formulate + analyze + baseline complete; tree at `79bd40d`) +**Status:** Plan + review fold-in (tree at `aa64ad3`) **Date:** 2026-05-16 +**Review findings (independent Sonnet agent, 2026-05-16) — 2 BLOCKERs +resolved in-place, 4 CONCERNs folded, 4 NITs applied:** + +R1 (BLOCKER, RESOLVED). **§5 pseudocode missing per-turn cache pattern.** + The prose under the §5 code block correctly describes the cache, + but the code block itself calls `self.tokenize_fn(t.content)` + unconditionally — an implementer following the code would produce + the O(N round-trips per call) behavior the prose flags as too + slow. **Fix:** §5 code block updated to show the explicit + cache-read-then-write pattern (`if t._tokens then ... else + t._tokens = self.tokenize_fn(t.content) end`). §13 commit 2 row + also calls this out explicitly. + +R2 (BLOCKER, RESOLVED). **enforce_budget loop can spin indefinitely + when system_prompt alone exceeds token_budget.** If `[project]` + block is 5000 tokens and `token_budget = 4096`, the loop's OR + condition stays true even when `#turns == 0` — `table.remove` + is a no-op, the loop never exits. **Fix:** §13 commit 3 row + updated to specify the explicit guard: `while (#self.turns > + self.max_turns or self:estimate_tokens() > self.token_budget) + and #self.turns > 0 do`. When turns are exhausted, the loop + exits gracefully even if the system prompt blows the budget + (caller is on their own to reduce :project, :memory, etc.). + +R3 (CONCERN, FOLDED). **`:cost detail` comparison semantically + undefined.** Sum-of-prompt_tokens across all calls (accumulator) + vs current-snapshot estimate are incommensurable — sessions + with evictions ALWAYS show divergence not because heuristic is + wrong but because they measure different things. **Resolution:** + §6 reworked to drop the per-slot `~est=N` inline annotation + (which conflated the two); instead show a SINGLE trailing + "[estimated session ctx: N tokens]" line under :cost detail. + Cleanly separates the running-total accumulator from the + current-snapshot estimate. §13 commit 4 already pointed to this + direction — now §6 matches. + +R4 (CONCERN, FOLDED). **tokenize_fn closure must reference `active_cfg` + by upvalue, not by value-capture.** If the implementer writes + `local cfg = active_cfg; return function(text) ... cfg ... end`, + the closure won't follow `:model` switches. **Fix:** §13 commit + 4 row gains an explicit code note: the closure MUST be + `function(text) return broker.token_count(active_cfg, text) end` + — direct upvalue reference. A5 analysis verified upvalue + semantics; now spelled out so the implementer doesn't subtly + miss it. + +R5 (CONCERN, FOLDED). **2s tokenize timeout can spuriously cache as + unsupported when llama.cpp is busy serving a concurrent + completion.** llama.cpp is single-threaded for inference; a + /tokenize request that arrives mid-generation queues behind + inference and may exceed the 2s cap. The capability would then + cache as `false` for the rest of the session, even though the + endpoint IS capable. **Fix:** §9 risk row added documenting + this. Mitigation: 2s is reasonable for IDLE responses but if + practical problems surface, bump to 5s or make configurable + (`cfg.tokenize.timeout_ms`). v1 ships 2s; revisit in verify if + it bites. + +R6 (CONCERN, FOLDED). **Per-endpoint cache key conflates two + same-endpoint/different-model presets.** B1 confirmed + /tokenize ignores the model field, so two probes per session + when one would suffice. **Fix:** §4 cache key SIMPLIFIED to + just `model_cfg.endpoint` (B1-justified). Same-endpoint + presets share one cache entry; one probe per endpoint per + session, not per (endpoint, model). For a future broker that + DOES honor the model field, this design choice would need + revisiting — documented inline. + +R-N1..N4 (NITs, APPLIED): + N1. §13 commit 3 condition uses uppercase `OR`/`AND` — corrected + to Lua's lowercase `or`/`and`. + N2. §10 Q-T5 row's "Resolution target" cell was empty; now reads + "Baseline (B1)" for consistency. + N3. §6 outdated inline `~est=N` description removed; new approach + (single trailing summary line) is documented; §8 out-of-scope + bullet about per-call comparison stays as the explicit "we + considered, rejected" record. + N4. PHASE8.md status header (formerly carrying a stale tree hash + that would drift before implementation) now references the + latest tree as of this fold-in (`aa64ad3`). Commit 5's status + bump to "Implement" will refresh it again at that point. + **Analyze findings (2026-05-16):** A1. **enforce_budget ONLY checks max_turns, not token_budget — major @@ -195,13 +277,15 @@ No new module files. ## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)` -```lua --- Per-endpoint capability cache (session-scoped local in broker.lua) -local _tokenize_capable = {} -- [endpoint .. "/" .. model] = true | false +R6-revised — cache key is endpoint-only (B1: /tokenize ignores the +model field so two presets sharing an endpoint share one cache entry): -local function _cache_key(model_cfg) - return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "") -end +```lua +-- Per-endpoint capability cache (session-scoped local in broker.lua). +-- Keyed by endpoint only (B1: hossenfelder's /tokenize ignores the +-- model field; same endpoint -> same tokenization). If a future +-- broker honors the model field, revisit this keying. +local _tokenize_capable = {} -- [endpoint] = true | false function M.token_count(model_cfg, text) text = text or "" @@ -209,34 +293,34 @@ function M.token_count(model_cfg, text) if not (model_cfg and model_cfg.endpoint) then return math.floor(#text / 4) -- pure fallback end - local key = _cache_key(model_cfg) - local cap = _tokenize_capable[key] + local ep = model_cfg.endpoint + local cap = _tokenize_capable[ep] if cap == false then return math.floor(#text / 4) end -- cap == nil OR cap == true; try the endpoint. - local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize" + local url = ep:gsub("/+$", "") .. "/tokenize" local body = json.encode({ content = text, model = model_cfg.model }) local out, status = curl.post(url, body, { "Content-Type: application/json" }, 2000) -- 2s timeout if not (status == 200 and out) then - _tokenize_capable[key] = false + _tokenize_capable[ep] = false return math.floor(#text / 4) end local doc = json.decode(out) local toks = doc and doc.tokens if type(toks) ~= "table" then - _tokenize_capable[key] = false + _tokenize_capable[ep] = false return math.floor(#text / 4) end - _tokenize_capable[key] = true + _tokenize_capable[ep] = true return #toks end function M.tokenize_supported(model_cfg) - if not model_cfg then return nil end - return _tokenize_capable[_cache_key(model_cfg)] + if not (model_cfg and model_cfg.endpoint) then return nil end + return _tokenize_capable[model_cfg.endpoint] end ``` @@ -246,6 +330,8 @@ Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status). ## 5. Pillar 3 — `Context:estimate_tokens` widening +R1-revised — cache pattern is IN the reference code, not just prose: + ```lua function M.new(opts) ... @@ -261,9 +347,18 @@ end function Context:estimate_tokens() if self.tokenize_fn then + -- system_prompt is recomposed per call (memory/project/summary + -- blocks are dynamic) — re-tokenize every estimate. Bounded + -- by one round-trip. local n = self.tokenize_fn(self.system_prompt) + -- R1: per-turn cache on the turn dict itself. Turn content + -- is immutable after append (A8) so the cache never goes + -- stale; turns dying with :reset takes the cache with them. for _, t in ipairs(self.turns) do - n = n + self.tokenize_fn(t.content) + if t._tokens == nil then + t._tokens = self.tokenize_fn(t.content) + end + n = n + t._tokens end return n end @@ -274,12 +369,11 @@ function Context:estimate_tokens() end ``` -Performance note: with N turns of average ~500 chars each, `estimate_tokens` -fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for -per-prompt eviction checks. **Mitigation**: per-turn token count cached -ON the turn dict (`turn._tokens`) the first time it's counted; only -re-tokenized if `turn._tokens` is nil. Set at append time when -tokenize_fn is present; otherwise lazily on first estimate. +Performance: first call after a fresh session fires N+1 round-trips +(N turns + 1 system prompt). Subsequent calls fire 1 (system prompt) ++ N dict lookups. For N=40, that's 40 × ~30ms = 1.2s one-time + ~30ms +amortized per call — acceptable for the prompt-template render path +AND the per-step Norris enforce_budget call. --- @@ -291,25 +385,27 @@ Current `:cost detail` (Phase 7) shows: anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219 ``` -The `179 / 8` is `prompt_tokens / completion_tokens` from the actual -broker usage payload. +The `179 / 8` is `prompt_tokens / completion_tokens` SUMMED across all +calls in that slot — including any turns later evicted from context. -Phase 8 extension: for each slot, also compute an estimated count via -`broker.token_count(model_cfg_for_this_slot, ...)` over the -TURNS THAT CONTRIBUTED to this slot. But that's stateful and -expensive — simpler: show the SUM of `prompt_tokens` (actual) and -the SUM of `estimate_tokens()` (heuristic OR endpoint-based, depending -on what tokenize_fn is wired). If disagreement >10%, annotate. +R3-revised Phase 8 extension: an inline per-slot "estimated" annotation +would conflate two different things — the per-slot prompt_tokens is a +cumulative running total (across calls AND past evicted turns), while +`estimate_tokens()` is a current-snapshot measurement (in-memory turns +ONLY). Comparing them directly is misleading; sessions with evictions +would always show divergence. -Simplified format: +Instead, add a SINGLE trailing summary line after the slot rows: ``` - anthropic/claude-haiku-4.5 main 1 calls, 179 ~est=164 / 8 tokens, $0.000219 - ^^^^^^^^^^ shown when |actual-est|/actual > 0.10 + ... per-slot rows ... + [estimated session ctx: 412 tokens; token_budget=4096 (10% used)] ``` -The `~est=N` annotation only renders when the disagreement exceeds the -10% threshold. Silent otherwise. +The estimate is `ctx:estimate_tokens()` over the current ctx (system +prompt + live turns); the percentage gives at-a-glance budget +utilization. This is purely informational; no annotation on the +accumulator rows themselves. --- @@ -366,6 +462,7 @@ after one probe). No surprise; document in config example. | `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. | | Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. | | A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. | +| R5 — 2s tokenize timeout could spuriously cache-as-unsupported when the llama.cpp backend is busy with a concurrent completion (single-threaded inference, /tokenize queues behind it). Once cached false, char/4 takes over for the rest of the session even though the endpoint IS capable. | 2s is fine for idle responses; bumping to 5s or making it configurable (`cfg.tokenize.timeout_ms`) is a v1.1 polish if it bites in practice. Documented; revisit during verify. | --- @@ -377,7 +474,7 @@ after one probe). No surprise; document in config example. | Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. | | Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. | | Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. | -| Q-T5 | `/tokenize` honors `model` field | **Baseline** (not yet probed in detail — assumed yes since llama.cpp echoes `model` in completions; needs explicit multi-model probe). | +| Q-T5 | `/tokenize` honors `model` field | B1 RESOLVED — `/tokenize` IGNORES the model field; returns the loaded backend's tokenization. Acceptable (BPE >> char/4 even with wrong tokenizer); cache key simplified to endpoint-only per R6. | | Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. | --- @@ -439,37 +536,47 @@ commit feature smoke). turn._tokens. 3. **`context.lua` — enforce_budget honors token_budget (pillar 5).** - - Existing `while #self.turns > self.max_turns` loop extended to: - `while #self.turns > self.max_turns OR self:estimate_tokens() > self.token_budget do`. - - Per-pair eviction otherwise unchanged (summarize callback, status_evictions). + - Existing `while #self.turns > self.max_turns` loop extended. + **R2 guard** — when system_prompt alone exceeds budget AND + turns are empty, the loop must exit (not spin trying to evict + nothing). Correct condition: + ```lua + while (#self.turns > self.max_turns + or self:estimate_tokens() > self.token_budget) + and #self.turns > 0 do + ``` + Lowercase `or`/`and` per Lua syntax (N1). + - Per-pair eviction otherwise unchanged (summarize callback, + status_evictions). - The estimate_tokens call inside the loop is potentially expensive under tokenize_fn — but commit #2's per-turn cache means each - iteration is O(#turns) dict-lookups. Acceptable for the eviction - hot path. - - Smoke: Context with `token_budget = 100`, max_turns = 100, fill - with turns until `estimate_tokens() > 100`, then call enforce_budget - — should evict until under budget. With char/4: eviction happens - at known char counts. With tokenize_fn stub returning 50/turn: - evicts down to <= 100/50 = 2 turns. + iteration is O(#turns) dict-lookups after the first. Acceptable + for the eviction hot path. + - Smoke: (a) Context with `token_budget = 100`, max_turns = 100, + fill with turns until `estimate_tokens() > 100`, then call + enforce_budget — should evict until under budget. (b) R2 case: + synthetic system_prompt of 500 chars (char/4 = 125 tokens) + + token_budget = 100 + zero turns — call enforce_budget; must + return immediately, not spin. -4. **`repl.lua` — tokenize_fn wiring + :cost detail est-vs-actual.** +4. **`repl.lua` — tokenize_fn wiring + :cost detail estimate row.** - When `config.tokenize and config.tokenize.use_endpoint`, build `ctx_opts.tokenize_fn = function(text) return broker.token_count(active_cfg, text) - end`. The closure captures `active_cfg` upval per A5 — follows - `:model` switches naturally. - - `:cost detail` extension: per slot, compute the sum of actual - `prompt_tokens` (already in usage_totals) AND an "estimated" - count via... actually this needs a rethink: the accumulator - doesn't store per-turn data, just running sums. The estimate - for comparison would need to be over ctx's CURRENT content - (a single snapshot, not per-call). Simplest: show a SECOND - line under `:cost detail` with the current `estimate_tokens()` - value, labeled "[estimated session ctx: N tokens]". Disagreement - with the accumulator's prompt_tokens TOTAL surfaced inline. + end`. **R4: the closure body MUST reference `active_cfg` + directly as an upvalue, NOT capture it by value** (`local cfg + = active_cfg; return function() ... cfg ... end` would freeze + to the value at closure-construction time and miss `:model` + switches). A5 verified upvalue semantics in Lua. + - `:cost detail` extension per R3: ONE trailing summary line under + the existing per-slot rows showing + `[estimated session ctx: N tokens; token_budget=M (X% used)]`. + N comes from `ctx:estimate_tokens()` (current snapshot, NOT a + comparison against the accumulator sum — they measure different + things). M is `ctx.token_budget`. X% = N/M × 100. - Smoke: with use_endpoint=true on a local-only session, observe - enforce_budget eviction timing vs disable; observe :cost detail - estimate row. + enforce_budget eviction timing vs disabled; observe :cost detail + estimate row updates as turns accumulate. 5. **`config.lua` example block + `docs/PHASE8.md` status bump.** - Commented-out `tokenize = { use_endpoint = true }` block in