docs/PHASE8: review fold-in — 2 BLOCKERs + 4 CONCERNs + 4 NITs
Sonnet-reviewed per reviews-use-sonnet memory directive.
BLOCKERs (RESOLVED in-place):
R1. §5 estimate_tokens pseudocode missing per-turn cache pattern.
Prose described it; code block called tokenize_fn unconditionally.
Implementer following code verbatim would hit the O(N round-
trips per call) perf gap the prose flagged. Code block now
shows explicit `if t._tokens then ... else t._tokens = ... end`.
R2. enforce_budget loop can spin forever when system_prompt alone
exceeds token_budget (e.g. 5KB project block + budget=4096 +
zero turns -> turns can't shrink further but OR-condition stays
true). Fix: AND `#self.turns > 0` guard on the loop. §13 commit
3 row shows the explicit Lua-syntax condition.
CONCERNs (FOLDED):
R3. :cost detail per-slot ~est=N annotation was semantically
undefined — accumulator sum (cumulative across calls + evicted
turns) vs current-snapshot estimate are incommensurable. §6
reworked: ONE trailing summary line "[estimated session ctx:
N tokens; token_budget=M (X% used)]" instead of per-slot
annotations. §13 commit 4 aligned.
R4. tokenize_fn closure MUST reference active_cfg as upvalue (NOT
capture by value). Subtle but easy to miss — §13 commit 4 now
spells out the correct vs wrong patterns explicitly.
R5. 2s tokenize timeout can spuriously cache-as-unsupported when
llama.cpp is busy with a concurrent completion (single-threaded
inference; /tokenize queues behind). Documented in §9; v1
ships 2s, revisit during verify if it bites.
R6. Per-endpoint cache key conflated two same-endpoint/different-
model presets (B1: /tokenize ignores the model field). Cache
key simplified to endpoint-only. One probe per endpoint per
session; if a future broker honors the model field, revisit.
NITs (APPLIED):
N1. §13 commit 3 `OR`/`AND` -> Lua-syntax `or`/`and`.
N2. §10 Q-T5 Resolution-target cell filled in (was blank after B1).
N3. §6 / §8 / §13 commit 4 now describe a CONSISTENT approach
(trailing summary line; per-slot annotation dropped).
N4. Status header tree-hash updated to current (aa64ad3 -> stays
fresh through review fold-in; commit 5 will refresh again
at "Implement" status).
PHASE8.md now 622 lines (was 454 after plan). +168/-61. Ready for
implementation phase 6 of the inner loop.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+167
-60
@@ -2,9 +2,91 @@
|
||||
|
||||
**Project:** aish — AI-augmented conversational shell
|
||||
**Document:** Phase 8 Requirements, Architecture & Design Decisions
|
||||
**Status:** Plan (formulate + analyze + baseline complete; tree at `79bd40d`)
|
||||
**Status:** Plan + review fold-in (tree at `aa64ad3`)
|
||||
**Date:** 2026-05-16
|
||||
|
||||
**Review findings (independent Sonnet agent, 2026-05-16) — 2 BLOCKERs
|
||||
resolved in-place, 4 CONCERNs folded, 4 NITs applied:**
|
||||
|
||||
R1 (BLOCKER, RESOLVED). **§5 pseudocode missing per-turn cache pattern.**
|
||||
The prose under the §5 code block correctly describes the cache,
|
||||
but the code block itself calls `self.tokenize_fn(t.content)`
|
||||
unconditionally — an implementer following the code would produce
|
||||
the O(N round-trips per call) behavior the prose flags as too
|
||||
slow. **Fix:** §5 code block updated to show the explicit
|
||||
cache-read-then-write pattern (`if t._tokens then ... else
|
||||
t._tokens = self.tokenize_fn(t.content) end`). §13 commit 2 row
|
||||
also calls this out explicitly.
|
||||
|
||||
R2 (BLOCKER, RESOLVED). **enforce_budget loop can spin indefinitely
|
||||
when system_prompt alone exceeds token_budget.** If `[project]`
|
||||
block is 5000 tokens and `token_budget = 4096`, the loop's OR
|
||||
condition stays true even when `#turns == 0` — `table.remove`
|
||||
is a no-op, the loop never exits. **Fix:** §13 commit 3 row
|
||||
updated to specify the explicit guard: `while (#self.turns >
|
||||
self.max_turns or self:estimate_tokens() > self.token_budget)
|
||||
and #self.turns > 0 do`. When turns are exhausted, the loop
|
||||
exits gracefully even if the system prompt blows the budget
|
||||
(caller is on their own to reduce :project, :memory, etc.).
|
||||
|
||||
R3 (CONCERN, FOLDED). **`:cost detail` comparison semantically
|
||||
undefined.** Sum-of-prompt_tokens across all calls (accumulator)
|
||||
vs current-snapshot estimate are incommensurable — sessions
|
||||
with evictions ALWAYS show divergence not because heuristic is
|
||||
wrong but because they measure different things. **Resolution:**
|
||||
§6 reworked to drop the per-slot `~est=N` inline annotation
|
||||
(which conflated the two); instead show a SINGLE trailing
|
||||
"[estimated session ctx: N tokens]" line under :cost detail.
|
||||
Cleanly separates the running-total accumulator from the
|
||||
current-snapshot estimate. §13 commit 4 already pointed to this
|
||||
direction — now §6 matches.
|
||||
|
||||
R4 (CONCERN, FOLDED). **tokenize_fn closure must reference `active_cfg`
|
||||
by upvalue, not by value-capture.** If the implementer writes
|
||||
`local cfg = active_cfg; return function(text) ... cfg ... end`,
|
||||
the closure won't follow `:model` switches. **Fix:** §13 commit
|
||||
4 row gains an explicit code note: the closure MUST be
|
||||
`function(text) return broker.token_count(active_cfg, text) end`
|
||||
— direct upvalue reference. A5 analysis verified upvalue
|
||||
semantics; now spelled out so the implementer doesn't subtly
|
||||
miss it.
|
||||
|
||||
R5 (CONCERN, FOLDED). **2s tokenize timeout can spuriously cache as
|
||||
unsupported when llama.cpp is busy serving a concurrent
|
||||
completion.** llama.cpp is single-threaded for inference; a
|
||||
/tokenize request that arrives mid-generation queues behind
|
||||
inference and may exceed the 2s cap. The capability would then
|
||||
cache as `false` for the rest of the session, even though the
|
||||
endpoint IS capable. **Fix:** §9 risk row added documenting
|
||||
this. Mitigation: 2s is reasonable for IDLE responses but if
|
||||
practical problems surface, bump to 5s or make configurable
|
||||
(`cfg.tokenize.timeout_ms`). v1 ships 2s; revisit in verify if
|
||||
it bites.
|
||||
|
||||
R6 (CONCERN, FOLDED). **Per-endpoint cache key conflates two
|
||||
same-endpoint/different-model presets.** B1 confirmed
|
||||
/tokenize ignores the model field, so two probes per session
|
||||
when one would suffice. **Fix:** §4 cache key SIMPLIFIED to
|
||||
just `model_cfg.endpoint` (B1-justified). Same-endpoint
|
||||
presets share one cache entry; one probe per endpoint per
|
||||
session, not per (endpoint, model). For a future broker that
|
||||
DOES honor the model field, this design choice would need
|
||||
revisiting — documented inline.
|
||||
|
||||
R-N1..N4 (NITs, APPLIED):
|
||||
N1. §13 commit 3 condition uses uppercase `OR`/`AND` — corrected
|
||||
to Lua's lowercase `or`/`and`.
|
||||
N2. §10 Q-T5 row's "Resolution target" cell was empty; now reads
|
||||
"Baseline (B1)" for consistency.
|
||||
N3. §6 outdated inline `~est=N` description removed; new approach
|
||||
(single trailing summary line) is documented; §8 out-of-scope
|
||||
bullet about per-call comparison stays as the explicit "we
|
||||
considered, rejected" record.
|
||||
N4. PHASE8.md status header (formerly carrying a stale tree hash
|
||||
that would drift before implementation) now references the
|
||||
latest tree as of this fold-in (`aa64ad3`). Commit 5's status
|
||||
bump to "Implement" will refresh it again at that point.
|
||||
|
||||
**Analyze findings (2026-05-16):**
|
||||
|
||||
A1. **enforce_budget ONLY checks max_turns, not token_budget — major
|
||||
@@ -195,13 +277,15 @@ No new module files.
|
||||
|
||||
## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)`
|
||||
|
||||
```lua
|
||||
-- Per-endpoint capability cache (session-scoped local in broker.lua)
|
||||
local _tokenize_capable = {} -- [endpoint .. "/" .. model] = true | false
|
||||
R6-revised — cache key is endpoint-only (B1: /tokenize ignores the
|
||||
model field so two presets sharing an endpoint share one cache entry):
|
||||
|
||||
local function _cache_key(model_cfg)
|
||||
return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
|
||||
end
|
||||
```lua
|
||||
-- Per-endpoint capability cache (session-scoped local in broker.lua).
|
||||
-- Keyed by endpoint only (B1: hossenfelder's /tokenize ignores the
|
||||
-- model field; same endpoint -> same tokenization). If a future
|
||||
-- broker honors the model field, revisit this keying.
|
||||
local _tokenize_capable = {} -- [endpoint] = true | false
|
||||
|
||||
function M.token_count(model_cfg, text)
|
||||
text = text or ""
|
||||
@@ -209,34 +293,34 @@ function M.token_count(model_cfg, text)
|
||||
if not (model_cfg and model_cfg.endpoint) then
|
||||
return math.floor(#text / 4) -- pure fallback
|
||||
end
|
||||
local key = _cache_key(model_cfg)
|
||||
local cap = _tokenize_capable[key]
|
||||
local ep = model_cfg.endpoint
|
||||
local cap = _tokenize_capable[ep]
|
||||
if cap == false then
|
||||
return math.floor(#text / 4)
|
||||
end
|
||||
-- cap == nil OR cap == true; try the endpoint.
|
||||
local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
|
||||
local url = ep:gsub("/+$", "") .. "/tokenize"
|
||||
local body = json.encode({ content = text, model = model_cfg.model })
|
||||
local out, status = curl.post(url, body,
|
||||
{ "Content-Type: application/json" },
|
||||
2000) -- 2s timeout
|
||||
if not (status == 200 and out) then
|
||||
_tokenize_capable[key] = false
|
||||
_tokenize_capable[ep] = false
|
||||
return math.floor(#text / 4)
|
||||
end
|
||||
local doc = json.decode(out)
|
||||
local toks = doc and doc.tokens
|
||||
if type(toks) ~= "table" then
|
||||
_tokenize_capable[key] = false
|
||||
_tokenize_capable[ep] = false
|
||||
return math.floor(#text / 4)
|
||||
end
|
||||
_tokenize_capable[key] = true
|
||||
_tokenize_capable[ep] = true
|
||||
return #toks
|
||||
end
|
||||
|
||||
function M.tokenize_supported(model_cfg)
|
||||
if not model_cfg then return nil end
|
||||
return _tokenize_capable[_cache_key(model_cfg)]
|
||||
if not (model_cfg and model_cfg.endpoint) then return nil end
|
||||
return _tokenize_capable[model_cfg.endpoint]
|
||||
end
|
||||
```
|
||||
|
||||
@@ -246,6 +330,8 @@ Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status).
|
||||
|
||||
## 5. Pillar 3 — `Context:estimate_tokens` widening
|
||||
|
||||
R1-revised — cache pattern is IN the reference code, not just prose:
|
||||
|
||||
```lua
|
||||
function M.new(opts)
|
||||
...
|
||||
@@ -261,9 +347,18 @@ end
|
||||
|
||||
function Context:estimate_tokens()
|
||||
if self.tokenize_fn then
|
||||
-- system_prompt is recomposed per call (memory/project/summary
|
||||
-- blocks are dynamic) — re-tokenize every estimate. Bounded
|
||||
-- by one round-trip.
|
||||
local n = self.tokenize_fn(self.system_prompt)
|
||||
-- R1: per-turn cache on the turn dict itself. Turn content
|
||||
-- is immutable after append (A8) so the cache never goes
|
||||
-- stale; turns dying with :reset takes the cache with them.
|
||||
for _, t in ipairs(self.turns) do
|
||||
n = n + self.tokenize_fn(t.content)
|
||||
if t._tokens == nil then
|
||||
t._tokens = self.tokenize_fn(t.content)
|
||||
end
|
||||
n = n + t._tokens
|
||||
end
|
||||
return n
|
||||
end
|
||||
@@ -274,12 +369,11 @@ function Context:estimate_tokens()
|
||||
end
|
||||
```
|
||||
|
||||
Performance note: with N turns of average ~500 chars each, `estimate_tokens`
|
||||
fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for
|
||||
per-prompt eviction checks. **Mitigation**: per-turn token count cached
|
||||
ON the turn dict (`turn._tokens`) the first time it's counted; only
|
||||
re-tokenized if `turn._tokens` is nil. Set at append time when
|
||||
tokenize_fn is present; otherwise lazily on first estimate.
|
||||
Performance: first call after a fresh session fires N+1 round-trips
|
||||
(N turns + 1 system prompt). Subsequent calls fire 1 (system prompt)
|
||||
+ N dict lookups. For N=40, that's 40 × ~30ms = 1.2s one-time + ~30ms
|
||||
amortized per call — acceptable for the prompt-template render path
|
||||
AND the per-step Norris enforce_budget call.
|
||||
|
||||
---
|
||||
|
||||
@@ -291,25 +385,27 @@ Current `:cost detail` (Phase 7) shows:
|
||||
anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219
|
||||
```
|
||||
|
||||
The `179 / 8` is `prompt_tokens / completion_tokens` from the actual
|
||||
broker usage payload.
|
||||
The `179 / 8` is `prompt_tokens / completion_tokens` SUMMED across all
|
||||
calls in that slot — including any turns later evicted from context.
|
||||
|
||||
Phase 8 extension: for each slot, also compute an estimated count via
|
||||
`broker.token_count(model_cfg_for_this_slot, ...)` over the
|
||||
TURNS THAT CONTRIBUTED to this slot. But that's stateful and
|
||||
expensive — simpler: show the SUM of `prompt_tokens` (actual) and
|
||||
the SUM of `estimate_tokens()` (heuristic OR endpoint-based, depending
|
||||
on what tokenize_fn is wired). If disagreement >10%, annotate.
|
||||
R3-revised Phase 8 extension: an inline per-slot "estimated" annotation
|
||||
would conflate two different things — the per-slot prompt_tokens is a
|
||||
cumulative running total (across calls AND past evicted turns), while
|
||||
`estimate_tokens()` is a current-snapshot measurement (in-memory turns
|
||||
ONLY). Comparing them directly is misleading; sessions with evictions
|
||||
would always show divergence.
|
||||
|
||||
Simplified format:
|
||||
Instead, add a SINGLE trailing summary line after the slot rows:
|
||||
|
||||
```
|
||||
anthropic/claude-haiku-4.5 main 1 calls, 179 ~est=164 / 8 tokens, $0.000219
|
||||
^^^^^^^^^^ shown when |actual-est|/actual > 0.10
|
||||
... per-slot rows ...
|
||||
[estimated session ctx: 412 tokens; token_budget=4096 (10% used)]
|
||||
```
|
||||
|
||||
The `~est=N` annotation only renders when the disagreement exceeds the
|
||||
10% threshold. Silent otherwise.
|
||||
The estimate is `ctx:estimate_tokens()` over the current ctx (system
|
||||
prompt + live turns); the percentage gives at-a-glance budget
|
||||
utilization. This is purely informational; no annotation on the
|
||||
accumulator rows themselves.
|
||||
|
||||
---
|
||||
|
||||
@@ -366,6 +462,7 @@ after one probe). No surprise; document in config example.
|
||||
| `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
|
||||
| Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
|
||||
| A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. |
|
||||
| R5 — 2s tokenize timeout could spuriously cache-as-unsupported when the llama.cpp backend is busy with a concurrent completion (single-threaded inference, /tokenize queues behind it). Once cached false, char/4 takes over for the rest of the session even though the endpoint IS capable. | 2s is fine for idle responses; bumping to 5s or making it configurable (`cfg.tokenize.timeout_ms`) is a v1.1 polish if it bites in practice. Documented; revisit during verify. |
|
||||
|
||||
---
|
||||
|
||||
@@ -377,7 +474,7 @@ after one probe). No surprise; document in config example.
|
||||
| Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. |
|
||||
| Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. |
|
||||
| Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. |
|
||||
| Q-T5 | `/tokenize` honors `model` field | **Baseline** (not yet probed in detail — assumed yes since llama.cpp echoes `model` in completions; needs explicit multi-model probe). |
|
||||
| Q-T5 | `/tokenize` honors `model` field | B1 RESOLVED — `/tokenize` IGNORES the model field; returns the loaded backend's tokenization. Acceptable (BPE >> char/4 even with wrong tokenizer); cache key simplified to endpoint-only per R6. |
|
||||
| Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. |
|
||||
|
||||
---
|
||||
@@ -439,37 +536,47 @@ commit feature smoke).
|
||||
turn._tokens.
|
||||
|
||||
3. **`context.lua` — enforce_budget honors token_budget (pillar 5).**
|
||||
- Existing `while #self.turns > self.max_turns` loop extended to:
|
||||
`while #self.turns > self.max_turns OR self:estimate_tokens() > self.token_budget do`.
|
||||
- Per-pair eviction otherwise unchanged (summarize callback, status_evictions).
|
||||
- Existing `while #self.turns > self.max_turns` loop extended.
|
||||
**R2 guard** — when system_prompt alone exceeds budget AND
|
||||
turns are empty, the loop must exit (not spin trying to evict
|
||||
nothing). Correct condition:
|
||||
```lua
|
||||
while (#self.turns > self.max_turns
|
||||
or self:estimate_tokens() > self.token_budget)
|
||||
and #self.turns > 0 do
|
||||
```
|
||||
Lowercase `or`/`and` per Lua syntax (N1).
|
||||
- Per-pair eviction otherwise unchanged (summarize callback,
|
||||
status_evictions).
|
||||
- The estimate_tokens call inside the loop is potentially expensive
|
||||
under tokenize_fn — but commit #2's per-turn cache means each
|
||||
iteration is O(#turns) dict-lookups. Acceptable for the eviction
|
||||
hot path.
|
||||
- Smoke: Context with `token_budget = 100`, max_turns = 100, fill
|
||||
with turns until `estimate_tokens() > 100`, then call enforce_budget
|
||||
— should evict until under budget. With char/4: eviction happens
|
||||
at known char counts. With tokenize_fn stub returning 50/turn:
|
||||
evicts down to <= 100/50 = 2 turns.
|
||||
iteration is O(#turns) dict-lookups after the first. Acceptable
|
||||
for the eviction hot path.
|
||||
- Smoke: (a) Context with `token_budget = 100`, max_turns = 100,
|
||||
fill with turns until `estimate_tokens() > 100`, then call
|
||||
enforce_budget — should evict until under budget. (b) R2 case:
|
||||
synthetic system_prompt of 500 chars (char/4 = 125 tokens) +
|
||||
token_budget = 100 + zero turns — call enforce_budget; must
|
||||
return immediately, not spin.
|
||||
|
||||
4. **`repl.lua` — tokenize_fn wiring + :cost detail est-vs-actual.**
|
||||
4. **`repl.lua` — tokenize_fn wiring + :cost detail estimate row.**
|
||||
- When `config.tokenize and config.tokenize.use_endpoint`, build
|
||||
`ctx_opts.tokenize_fn = function(text)
|
||||
return broker.token_count(active_cfg, text)
|
||||
end`. The closure captures `active_cfg` upval per A5 — follows
|
||||
`:model` switches naturally.
|
||||
- `:cost detail` extension: per slot, compute the sum of actual
|
||||
`prompt_tokens` (already in usage_totals) AND an "estimated"
|
||||
count via... actually this needs a rethink: the accumulator
|
||||
doesn't store per-turn data, just running sums. The estimate
|
||||
for comparison would need to be over ctx's CURRENT content
|
||||
(a single snapshot, not per-call). Simplest: show a SECOND
|
||||
line under `:cost detail` with the current `estimate_tokens()`
|
||||
value, labeled "[estimated session ctx: N tokens]". Disagreement
|
||||
with the accumulator's prompt_tokens TOTAL surfaced inline.
|
||||
end`. **R4: the closure body MUST reference `active_cfg`
|
||||
directly as an upvalue, NOT capture it by value** (`local cfg
|
||||
= active_cfg; return function() ... cfg ... end` would freeze
|
||||
to the value at closure-construction time and miss `:model`
|
||||
switches). A5 verified upvalue semantics in Lua.
|
||||
- `:cost detail` extension per R3: ONE trailing summary line under
|
||||
the existing per-slot rows showing
|
||||
`[estimated session ctx: N tokens; token_budget=M (X% used)]`.
|
||||
N comes from `ctx:estimate_tokens()` (current snapshot, NOT a
|
||||
comparison against the accumulator sum — they measure different
|
||||
things). M is `ctx.token_budget`. X% = N/M × 100.
|
||||
- Smoke: with use_endpoint=true on a local-only session, observe
|
||||
enforce_budget eviction timing vs disable; observe :cost detail
|
||||
estimate row.
|
||||
enforce_budget eviction timing vs disabled; observe :cost detail
|
||||
estimate row updates as turns accumulate.
|
||||
|
||||
5. **`config.lua` example block + `docs/PHASE8.md` status bump.**
|
||||
- Commented-out `tokenize = { use_endpoint = true }` block in
|
||||
|
||||
Reference in New Issue
Block a user