docs/PHASE8: review fold-in — 2 BLOCKERs + 4 CONCERNs + 4 NITs

Sonnet-reviewed per reviews-use-sonnet memory directive.

BLOCKERs (RESOLVED in-place):

R1. §5 estimate_tokens pseudocode missing per-turn cache pattern.
    Prose described it; code block called tokenize_fn unconditionally.
    Implementer following code verbatim would hit the O(N round-
    trips per call) perf gap the prose flagged. Code block now
    shows explicit `if t._tokens then ... else t._tokens = ... end`.

R2. enforce_budget loop can spin forever when system_prompt alone
    exceeds token_budget (e.g. 5KB project block + budget=4096 +
    zero turns -> turns can't shrink further but OR-condition stays
    true). Fix: AND `#self.turns > 0` guard on the loop. §13 commit
    3 row shows the explicit Lua-syntax condition.

CONCERNs (FOLDED):

R3. :cost detail per-slot ~est=N annotation was semantically
    undefined — accumulator sum (cumulative across calls + evicted
    turns) vs current-snapshot estimate are incommensurable. §6
    reworked: ONE trailing summary line "[estimated session ctx:
    N tokens; token_budget=M (X% used)]" instead of per-slot
    annotations. §13 commit 4 aligned.

R4. tokenize_fn closure MUST reference active_cfg as upvalue (NOT
    capture by value). Subtle but easy to miss — §13 commit 4 now
    spells out the correct vs wrong patterns explicitly.

R5. 2s tokenize timeout can spuriously cache-as-unsupported when
    llama.cpp is busy with a concurrent completion (single-threaded
    inference; /tokenize queues behind). Documented in §9; v1
    ships 2s, revisit during verify if it bites.

R6. Per-endpoint cache key conflated two same-endpoint/different-
    model presets (B1: /tokenize ignores the model field). Cache
    key simplified to endpoint-only. One probe per endpoint per
    session; if a future broker honors the model field, revisit.

NITs (APPLIED):

N1. §13 commit 3 `OR`/`AND` -> Lua-syntax `or`/`and`.
N2. §10 Q-T5 Resolution-target cell filled in (was blank after B1).
N3. §6 / §8 / §13 commit 4 now describe a CONSISTENT approach
    (trailing summary line; per-slot annotation dropped).
N4. Status header tree-hash updated to current (aa64ad3 -> stays
    fresh through review fold-in; commit 5 will refresh again
    at "Implement" status).

PHASE8.md now 622 lines (was 454 after plan). +168/-61. Ready for
implementation phase 6 of the inner loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-16 23:28:27 +00:00
parent aa64ad3eec
commit 467e573d24
+167 -60
View File
@@ -2,9 +2,91 @@
**Project:** aish — AI-augmented conversational shell
**Document:** Phase 8 Requirements, Architecture & Design Decisions
**Status:** Plan (formulate + analyze + baseline complete; tree at `79bd40d`)
**Status:** Plan + review fold-in (tree at `aa64ad3`)
**Date:** 2026-05-16
**Review findings (independent Sonnet agent, 2026-05-16) — 2 BLOCKERs
resolved in-place, 4 CONCERNs folded, 4 NITs applied:**
R1 (BLOCKER, RESOLVED). **§5 pseudocode missing per-turn cache pattern.**
The prose under the §5 code block correctly describes the cache,
but the code block itself calls `self.tokenize_fn(t.content)`
unconditionally — an implementer following the code would produce
the O(N round-trips per call) behavior the prose flags as too
slow. **Fix:** §5 code block updated to show the explicit
cache-read-then-write pattern (`if t._tokens then ... else
t._tokens = self.tokenize_fn(t.content) end`). §13 commit 2 row
also calls this out explicitly.
R2 (BLOCKER, RESOLVED). **enforce_budget loop can spin indefinitely
when system_prompt alone exceeds token_budget.** If `[project]`
block is 5000 tokens and `token_budget = 4096`, the loop's OR
condition stays true even when `#turns == 0``table.remove`
is a no-op, the loop never exits. **Fix:** §13 commit 3 row
updated to specify the explicit guard: `while (#self.turns >
self.max_turns or self:estimate_tokens() > self.token_budget)
and #self.turns > 0 do`. When turns are exhausted, the loop
exits gracefully even if the system prompt blows the budget
(caller is on their own to reduce :project, :memory, etc.).
R3 (CONCERN, FOLDED). **`:cost detail` comparison semantically
undefined.** Sum-of-prompt_tokens across all calls (accumulator)
vs current-snapshot estimate are incommensurable — sessions
with evictions ALWAYS show divergence not because heuristic is
wrong but because they measure different things. **Resolution:**
§6 reworked to drop the per-slot `~est=N` inline annotation
(which conflated the two); instead show a SINGLE trailing
"[estimated session ctx: N tokens]" line under :cost detail.
Cleanly separates the running-total accumulator from the
current-snapshot estimate. §13 commit 4 already pointed to this
direction — now §6 matches.
R4 (CONCERN, FOLDED). **tokenize_fn closure must reference `active_cfg`
by upvalue, not by value-capture.** If the implementer writes
`local cfg = active_cfg; return function(text) ... cfg ... end`,
the closure won't follow `:model` switches. **Fix:** §13 commit
4 row gains an explicit code note: the closure MUST be
`function(text) return broker.token_count(active_cfg, text) end`
— direct upvalue reference. A5 analysis verified upvalue
semantics; now spelled out so the implementer doesn't subtly
miss it.
R5 (CONCERN, FOLDED). **2s tokenize timeout can spuriously cache as
unsupported when llama.cpp is busy serving a concurrent
completion.** llama.cpp is single-threaded for inference; a
/tokenize request that arrives mid-generation queues behind
inference and may exceed the 2s cap. The capability would then
cache as `false` for the rest of the session, even though the
endpoint IS capable. **Fix:** §9 risk row added documenting
this. Mitigation: 2s is reasonable for IDLE responses but if
practical problems surface, bump to 5s or make configurable
(`cfg.tokenize.timeout_ms`). v1 ships 2s; revisit in verify if
it bites.
R6 (CONCERN, FOLDED). **Per-endpoint cache key conflates two
same-endpoint/different-model presets.** B1 confirmed
/tokenize ignores the model field, so two probes per session
when one would suffice. **Fix:** §4 cache key SIMPLIFIED to
just `model_cfg.endpoint` (B1-justified). Same-endpoint
presets share one cache entry; one probe per endpoint per
session, not per (endpoint, model). For a future broker that
DOES honor the model field, this design choice would need
revisiting — documented inline.
R-N1..N4 (NITs, APPLIED):
N1. §13 commit 3 condition uses uppercase `OR`/`AND` — corrected
to Lua's lowercase `or`/`and`.
N2. §10 Q-T5 row's "Resolution target" cell was empty; now reads
"Baseline (B1)" for consistency.
N3. §6 outdated inline `~est=N` description removed; new approach
(single trailing summary line) is documented; §8 out-of-scope
bullet about per-call comparison stays as the explicit "we
considered, rejected" record.
N4. PHASE8.md status header (formerly carrying a stale tree hash
that would drift before implementation) now references the
latest tree as of this fold-in (`aa64ad3`). Commit 5's status
bump to "Implement" will refresh it again at that point.
**Analyze findings (2026-05-16):**
A1. **enforce_budget ONLY checks max_turns, not token_budget — major
@@ -195,13 +277,15 @@ No new module files.
## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)`
```lua
-- Per-endpoint capability cache (session-scoped local in broker.lua)
local _tokenize_capable = {} -- [endpoint .. "/" .. model] = true | false
R6-revised — cache key is endpoint-only (B1: /tokenize ignores the
model field so two presets sharing an endpoint share one cache entry):
local function _cache_key(model_cfg)
return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
end
```lua
-- Per-endpoint capability cache (session-scoped local in broker.lua).
-- Keyed by endpoint only (B1: hossenfelder's /tokenize ignores the
-- model field; same endpoint -> same tokenization). If a future
-- broker honors the model field, revisit this keying.
local _tokenize_capable = {} -- [endpoint] = true | false
function M.token_count(model_cfg, text)
text = text or ""
@@ -209,34 +293,34 @@ function M.token_count(model_cfg, text)
if not (model_cfg and model_cfg.endpoint) then
return math.floor(#text / 4) -- pure fallback
end
local key = _cache_key(model_cfg)
local cap = _tokenize_capable[key]
local ep = model_cfg.endpoint
local cap = _tokenize_capable[ep]
if cap == false then
return math.floor(#text / 4)
end
-- cap == nil OR cap == true; try the endpoint.
local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
local url = ep:gsub("/+$", "") .. "/tokenize"
local body = json.encode({ content = text, model = model_cfg.model })
local out, status = curl.post(url, body,
{ "Content-Type: application/json" },
2000) -- 2s timeout
if not (status == 200 and out) then
_tokenize_capable[key] = false
_tokenize_capable[ep] = false
return math.floor(#text / 4)
end
local doc = json.decode(out)
local toks = doc and doc.tokens
if type(toks) ~= "table" then
_tokenize_capable[key] = false
_tokenize_capable[ep] = false
return math.floor(#text / 4)
end
_tokenize_capable[key] = true
_tokenize_capable[ep] = true
return #toks
end
function M.tokenize_supported(model_cfg)
if not model_cfg then return nil end
return _tokenize_capable[_cache_key(model_cfg)]
if not (model_cfg and model_cfg.endpoint) then return nil end
return _tokenize_capable[model_cfg.endpoint]
end
```
@@ -246,6 +330,8 @@ Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status).
## 5. Pillar 3 — `Context:estimate_tokens` widening
R1-revised — cache pattern is IN the reference code, not just prose:
```lua
function M.new(opts)
...
@@ -261,9 +347,18 @@ end
function Context:estimate_tokens()
if self.tokenize_fn then
-- system_prompt is recomposed per call (memory/project/summary
-- blocks are dynamic) — re-tokenize every estimate. Bounded
-- by one round-trip.
local n = self.tokenize_fn(self.system_prompt)
-- R1: per-turn cache on the turn dict itself. Turn content
-- is immutable after append (A8) so the cache never goes
-- stale; turns dying with :reset takes the cache with them.
for _, t in ipairs(self.turns) do
n = n + self.tokenize_fn(t.content)
if t._tokens == nil then
t._tokens = self.tokenize_fn(t.content)
end
n = n + t._tokens
end
return n
end
@@ -274,12 +369,11 @@ function Context:estimate_tokens()
end
```
Performance note: with N turns of average ~500 chars each, `estimate_tokens`
fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for
per-prompt eviction checks. **Mitigation**: per-turn token count cached
ON the turn dict (`turn._tokens`) the first time it's counted; only
re-tokenized if `turn._tokens` is nil. Set at append time when
tokenize_fn is present; otherwise lazily on first estimate.
Performance: first call after a fresh session fires N+1 round-trips
(N turns + 1 system prompt). Subsequent calls fire 1 (system prompt)
+ N dict lookups. For N=40, that's 40 × ~30ms = 1.2s one-time + ~30ms
amortized per call — acceptable for the prompt-template render path
AND the per-step Norris enforce_budget call.
---
@@ -291,25 +385,27 @@ Current `:cost detail` (Phase 7) shows:
anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219
```
The `179 / 8` is `prompt_tokens / completion_tokens` from the actual
broker usage payload.
The `179 / 8` is `prompt_tokens / completion_tokens` SUMMED across all
calls in that slot — including any turns later evicted from context.
Phase 8 extension: for each slot, also compute an estimated count via
`broker.token_count(model_cfg_for_this_slot, ...)` over the
TURNS THAT CONTRIBUTED to this slot. But that's stateful and
expensive — simpler: show the SUM of `prompt_tokens` (actual) and
the SUM of `estimate_tokens()` (heuristic OR endpoint-based, depending
on what tokenize_fn is wired). If disagreement >10%, annotate.
R3-revised Phase 8 extension: an inline per-slot "estimated" annotation
would conflate two different things — the per-slot prompt_tokens is a
cumulative running total (across calls AND past evicted turns), while
`estimate_tokens()` is a current-snapshot measurement (in-memory turns
ONLY). Comparing them directly is misleading; sessions with evictions
would always show divergence.
Simplified format:
Instead, add a SINGLE trailing summary line after the slot rows:
```
anthropic/claude-haiku-4.5 main 1 calls, 179 ~est=164 / 8 tokens, $0.000219
^^^^^^^^^^ shown when |actual-est|/actual > 0.10
... per-slot rows ...
[estimated session ctx: 412 tokens; token_budget=4096 (10% used)]
```
The `~est=N` annotation only renders when the disagreement exceeds the
10% threshold. Silent otherwise.
The estimate is `ctx:estimate_tokens()` over the current ctx (system
prompt + live turns); the percentage gives at-a-glance budget
utilization. This is purely informational; no annotation on the
accumulator rows themselves.
---
@@ -366,6 +462,7 @@ after one probe). No surprise; document in config example.
| `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
| Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
| A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. |
| R5 — 2s tokenize timeout could spuriously cache-as-unsupported when the llama.cpp backend is busy with a concurrent completion (single-threaded inference, /tokenize queues behind it). Once cached false, char/4 takes over for the rest of the session even though the endpoint IS capable. | 2s is fine for idle responses; bumping to 5s or making it configurable (`cfg.tokenize.timeout_ms`) is a v1.1 polish if it bites in practice. Documented; revisit during verify. |
---
@@ -377,7 +474,7 @@ after one probe). No surprise; document in config example.
| Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. |
| Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. |
| Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. |
| Q-T5 | `/tokenize` honors `model` field | **Baseline** (not yet probed in detail — assumed yes since llama.cpp echoes `model` in completions; needs explicit multi-model probe). |
| Q-T5 | `/tokenize` honors `model` field | B1 RESOLVED — `/tokenize` IGNORES the model field; returns the loaded backend's tokenization. Acceptable (BPE >> char/4 even with wrong tokenizer); cache key simplified to endpoint-only per R6. |
| Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. |
---
@@ -439,37 +536,47 @@ commit feature smoke).
turn._tokens.
3. **`context.lua` — enforce_budget honors token_budget (pillar 5).**
- Existing `while #self.turns > self.max_turns` loop extended to:
`while #self.turns > self.max_turns OR self:estimate_tokens() > self.token_budget do`.
- Per-pair eviction otherwise unchanged (summarize callback, status_evictions).
- Existing `while #self.turns > self.max_turns` loop extended.
**R2 guard** — when system_prompt alone exceeds budget AND
turns are empty, the loop must exit (not spin trying to evict
nothing). Correct condition:
```lua
while (#self.turns > self.max_turns
or self:estimate_tokens() > self.token_budget)
and #self.turns > 0 do
```
Lowercase `or`/`and` per Lua syntax (N1).
- Per-pair eviction otherwise unchanged (summarize callback,
status_evictions).
- The estimate_tokens call inside the loop is potentially expensive
under tokenize_fn — but commit #2's per-turn cache means each
iteration is O(#turns) dict-lookups. Acceptable for the eviction
hot path.
- Smoke: Context with `token_budget = 100`, max_turns = 100, fill
with turns until `estimate_tokens() > 100`, then call enforce_budget
— should evict until under budget. With char/4: eviction happens
at known char counts. With tokenize_fn stub returning 50/turn:
evicts down to <= 100/50 = 2 turns.
iteration is O(#turns) dict-lookups after the first. Acceptable
for the eviction hot path.
- Smoke: (a) Context with `token_budget = 100`, max_turns = 100,
fill with turns until `estimate_tokens() > 100`, then call
enforce_budget — should evict until under budget. (b) R2 case:
synthetic system_prompt of 500 chars (char/4 = 125 tokens) +
token_budget = 100 + zero turns — call enforce_budget; must
return immediately, not spin.
4. **`repl.lua` — tokenize_fn wiring + :cost detail est-vs-actual.**
4. **`repl.lua` — tokenize_fn wiring + :cost detail estimate row.**
- When `config.tokenize and config.tokenize.use_endpoint`, build
`ctx_opts.tokenize_fn = function(text)
return broker.token_count(active_cfg, text)
end`. The closure captures `active_cfg` upval per A5 — follows
`:model` switches naturally.
- `:cost detail` extension: per slot, compute the sum of actual
`prompt_tokens` (already in usage_totals) AND an "estimated"
count via... actually this needs a rethink: the accumulator
doesn't store per-turn data, just running sums. The estimate
for comparison would need to be over ctx's CURRENT content
(a single snapshot, not per-call). Simplest: show a SECOND
line under `:cost detail` with the current `estimate_tokens()`
value, labeled "[estimated session ctx: N tokens]". Disagreement
with the accumulator's prompt_tokens TOTAL surfaced inline.
end`. **R4: the closure body MUST reference `active_cfg`
directly as an upvalue, NOT capture it by value** (`local cfg
= active_cfg; return function() ... cfg ... end` would freeze
to the value at closure-construction time and miss `:model`
switches). A5 verified upvalue semantics in Lua.
- `:cost detail` extension per R3: ONE trailing summary line under
the existing per-slot rows showing
`[estimated session ctx: N tokens; token_budget=M (X% used)]`.
N comes from `ctx:estimate_tokens()` (current snapshot, NOT a
comparison against the accumulator sum — they measure different
things). M is `ctx.token_budget`. X% = N/M × 100.
- Smoke: with use_endpoint=true on a local-only session, observe
enforce_budget eviction timing vs disable; observe :cost detail
estimate row.
enforce_budget eviction timing vs disabled; observe :cost detail
estimate row updates as turns accumulate.
5. **`config.lua` example block + `docs/PHASE8.md` status bump.**
- Commented-out `tokenize = { use_endpoint = true }` block in