# aish — Phase 8 Manifest **Project:** aish — AI-augmented conversational shell **Document:** Phase 8 Requirements, Architecture & Design Decisions **Status:** Analyze (formulate complete; tree at `00869ba` probed) **Date:** 2026-05-16 **Analyze findings (2026-05-16):** A1. **enforce_budget ONLY checks max_turns, not token_budget — major scope gap.** `Context:enforce_budget` (context.lua:319) iterates `while #self.turns > self.max_turns`; `self.token_budget = 4096` is set but NEVER consulted. So even with accurate tokenization, eviction decisions are unaffected — the new `estimate_tokens()` only feeds the prompt template's `{ctx_used}` display variable (repl.lua:630). **Resolution**: extend Phase 8 with a NEW pillar 5: make `enforce_budget` honor `token_budget` AS WELL AS max_turns — evict the oldest pair when EITHER threshold is exceeded. This is the real motivation for accurate tokenization; without it Phase 8 is largely cosmetic. Folded into §1 (5 pillars now), §3 (context.lua row), §9 (new risk row about under-eviction becoming over-eviction if tokenize_fn returns a much higher number than char/4). A2. **`ffi.curl.M.post` signature confirmed.** `(body, status)` on success, `(nil, err)` on failure. Matches the formulate-time sketch. status is the integer HTTP code. The probe checks `status == 200 and out` correctly. A3. **Single caller of `Context:estimate_tokens()` in tree.** Only `repl.lua:630` (prompt template `{ctx_used}` substitution) calls it. No internal callers in context.lua. This means: - The wiring point is ONE line in repl.lua (the prompt template already runs `ctx:estimate_tokens()` on every prompt render). - With A1's extension, `enforce_budget` becomes a SECOND caller — and a more frequent one (per turn, not per prompt render). - Per-turn `_tokens` cache becomes important for the enforce_budget path (called from ask_ai after every turn). A4. **Q-T1 RESOLVED**: per-turn `_tokens` cache lives on the turn dict. `:reset` clears `ctx.turns` so the cache dies with them. New turns get nil `_tokens`; lazy-set on first count. Trivial. A5. **Q-T2 RESOLVED**: tokenize_fn closure captures `active_cfg` as an UPVALUE. Upvalues are resolved at closure call time, not at definition time. When the user `:model cloud` switches, `active_cfg = config.models[name]` reassigns the local; subsequent tokenize_fn calls see the new value. Natural; no explicit re-binding needed. A6. **Q-T3 RESOLVED**: skip the probe entirely when `cfg.tokenize.use_endpoint = false` or unset. Don't even call `broker.token_count` — repl.lua won't wire `tokenize_fn` to Context.new in the first place. context.lua's tokenize_fn-nil branch handles it (char/4 fallback). A7. **Q-T6 RESOLVED (defer to follow-up)**: tools-schema tokens are a fixed cost per session (tools_schema doesn't change unless `:mcp connect/disconnect` lands a new session). The under-count is bounded and predictable. Defer to a future polish; v1 counts only messages. Document in §8 out-of-scope. A8. **Per-turn `_tokens` cache invalidation.** Turn `content` is immutable after append (we don't mutate stored turns). Cache is safe to live forever on the turn. The only invalidation event is `:reset` (clears turns wholesale). No other invalidation needed. A9. **Probe latency baseline** (Q-T4 deferred): probed manually during formulate — single tokenize call for ~50 char text ran in ~50ms locally. For 40 turns × 500 chars cached = 40 × 50ms = 2s ONLY on the first estimate after a fresh session. After caching, subsequent estimates are O(1) per turn (dict lookups). A10. **Streaming during `chat_stream` interleaves with tokenize?** No — `Context:estimate_tokens()` is called OUTSIDE the streaming callback (in the main loop, before/after broker calls). No concurrent network competition. A11. **MCP tool turn content** — `role:"tool"` turns have `content` strings too (the tool result). These get tokenized identically; no special-case needed. Cache key is the turn dict itself, so tool turns get their own `_tokens` slot. A12. **`include_usage` interaction with tokenize**: orthogonal. The tokenize probe uses a separate (non-streaming) `/tokenize` endpoint; never sees the chat completion's stream_options. PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest specifies what Phase 8 adds — **accurate tokenization**: replace the char/4 heuristic on `Context:estimate_tokens()` with a per-broker `/tokenize` round-trip where supported, char/4 fallback otherwise. Resolves Q1 (`PHASE0.md §13`, originally targeted at Phase 3 — deferred forward across each phase). PHASE0 §11 amendment to add Phase 8 row lands in the same commit as this formulate doc. --- ## 1. Scope of Phase 8 Five pillars (A1 added pillar 5): 1. **Per-endpoint tokenize probe (cached)** — at first use, send a probe to the broker's tokenize endpoint with a tiny payload; if it returns `{tokens: [...]}` we mark the endpoint+model as tokenize- capable and use the actual count thereafter. If it 404s or errors, mark the slot as `tokenize_supported = false` and fall through to char/4 silently. Cached per `(endpoint, model)` for the session. 2. **`broker.token_count(model_cfg, text)`** — thin wrapper that returns an accurate token count when the (endpoint, model) is tokenize-capable, else the char/4 heuristic. Always returns a non-negative integer; never errors. The probe + fallback is transparent to callers. 3. **`Context:estimate_tokens()` widening** — currently char/4 over `system_prompt` + sum of `turn.content`s. The new shape accepts an optional `tokenize_fn` (callback) at `Context.new` time and uses it when present; falls back to char/4 when nil. `repl.lua` wires `tokenize_fn = function(text) return broker.token_count(active_cfg, text) end`. This means the active model's tokenizer is used for budgeting decisions, which matches the broker the next ask_ai will hit. 4. **`:cost detail` estimated-vs-actual column** — for each (model, category) slot in the accumulator, the actual `prompt_tokens` from broker usage is already stored. Add an estimated column computed via `broker.token_count` on the currently-buffered prompt-shape. Disagreement >10% surfaces in a tiny `~est=N` annotation so users can see when the heuristic diverges from reality. Display-only; no behavior change. 5. **`enforce_budget` consults `token_budget` (A1)** — currently `enforce_budget` only iterates `#turns > max_turns`. Extend to ALSO check `estimate_tokens() > token_budget`. Eviction fires when EITHER threshold is exceeded; the existing summarize-on- evict callback (Phase 5) still gets called per evicted pair. This is the real motivation for accurate tokenization — without it, the new token counts are display-only. Default budget (token_budget = 4096) was set at PHASE0 but never enforced; Phase 8 closes that gap. **Phase 8 is done when:** - A long-running session with the local `qwen-coder-7b-snappy-8k` model evicts at the RIGHT moment (token_budget=4096 hit triggers eviction via the new pillar 5 path) rather than only when max_turns is exceeded. - `broker.token_count(local_cfg, "hello world")` returns 2 (matches the live tokenize result, not the char/4=2 coincidence — verify via `:cost detail` against multi-paragraph text). - `broker.token_count(cloud_cfg, "hello world")` returns 2 (char/4 fallback when /tokenize 404s, which it does for OpenRouter). - Cached per-endpoint capability — the probe fires once per endpoint per session, not per call. - Existing configs without `cfg.tokenize` behave like Phase 7 (zero behavior change unless opted in via `cfg.tokenize.use_endpoint = true`). - `:cost detail` shows estimated-vs-actual where disagreement >10%, silent otherwise. --- ## 2. Technology Decisions (delta from Phase 7) | Decision | Choice | Rationale | |---|---|---| | Tokenize endpoint path | `/tokenize` (NOT `/v1/tokenize`) | Per real probe against hossenfelder: `/v1/tokenize` returns 404; `/tokenize` returns `{tokens: [...]}`. This is the llama.cpp server convention. | | Request body shape | `{"content": "", "model": ""}` | Local model echoed via `model`; llama.cpp ignores it but harmless. Probed shape works. | | Capability detection | Per-call optimistic probe; on 404/non-200, cache `tokenize_supported[endpoint][model] = false` and never retry that session | One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine. | | Fallback heuristic | char/4 (Phase 0 §8 convention) | Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available. | | `Context:estimate_tokens` calling convention | Optional `tokenize_fn` callback at Context.new; absent = char/4 (existing behavior) | Backward-compatible; no caller break. Opt-in via repl.lua. | | Active-model tokenizer | repl.lua wires `tokenize_fn` against `active_cfg` (the currently active model), so eviction decisions match the broker the next call will hit | When the user `:model cloud` switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize). | | Caching strategy | Endpoint+model capability flag only; NOT per-text token-count cache | Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint. | | Per-text timeout cap | 2s for tokenize calls (much tighter than the model's normal timeout_ms) | Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4. | | `:cost detail` est-vs-actual | Show only when disagreement >10%; format `(prompt: 558 ~est=508 / completion: 80)` for the disagreement case, `(prompt: 558 / completion: 80)` otherwise | Always-on noise; suppress when heuristic is close. | | New config key | `cfg.tokenize = { use_endpoint = true }` — default false until user opts in | Network round-trip cost; user-acknowledged behavior change. | --- ## 3. Module Changes | File | State after Phase 7 | Phase 8 changes | |---|---|---| | `broker.lua` | `chat`, `chat_stream`, `build_request` (opts-widened in Phase 7) | New `M.token_count(model_cfg, text)`: tries `/tokenize` once per (endpoint, model); caches capability; returns int. New `M.tokenize_supported(model_cfg)` introspection helper for tests. | | `context.lua` | `estimate_tokens()` char/4 sum over system_prompt + turn.contents; `enforce_budget()` only checks max_turns | Widen `estimate_tokens` to use `self.tokenize_fn(text)` if present; else char/4. Per-turn `_tokens` cache on each turn dict; lazy-set on first count. Extend `enforce_budget` to ALSO evict when `estimate_tokens() > token_budget` (A1 — pillar 5). | | `repl.lua` | wires Context.new with summarize_fn, hosts all metas | `tokenize_fn` wired into Context.new when `cfg.tokenize.use_endpoint = true`. `:cost detail` extended with est-vs-actual column. | | `config.lua` | Phase 7 cost block example | Add commented-out `tokenize = { use_endpoint = true }` block. | | `docs/PHASE0.md` | §11 lists phases 0-7 | Amendment: add Phase 8 row to §11. | No new module files. --- ## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)` ```lua -- Per-endpoint capability cache (session-scoped local in broker.lua) local _tokenize_capable = {} -- [endpoint .. "/" .. model] = true | false local function _cache_key(model_cfg) return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "") end function M.token_count(model_cfg, text) text = text or "" if text == "" then return 0 end if not (model_cfg and model_cfg.endpoint) then return math.floor(#text / 4) -- pure fallback end local key = _cache_key(model_cfg) local cap = _tokenize_capable[key] if cap == false then return math.floor(#text / 4) end -- cap == nil OR cap == true; try the endpoint. local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize" local body = json.encode({ content = text, model = model_cfg.model }) local out, status = curl.post(url, body, { "Content-Type: application/json" }, 2000) -- 2s timeout if not (status == 200 and out) then _tokenize_capable[key] = false return math.floor(#text / 4) end local doc = json.decode(out) local toks = doc and doc.tokens if type(toks) ~= "table" then _tokenize_capable[key] = false return math.floor(#text / 4) end _tokenize_capable[key] = true return #toks end function M.tokenize_supported(model_cfg) if not model_cfg then return nil end return _tokenize_capable[_cache_key(model_cfg)] end ``` Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status). --- ## 5. Pillar 3 — `Context:estimate_tokens` widening ```lua function M.new(opts) ... return setmetatable({ ... -- Phase 8: optional callback that returns an accurate token -- count for a given text. Set by repl.lua when cfg.tokenize. -- use_endpoint=true, calling broker.token_count(active_cfg, ...). -- nil = char/4 fallback (Phase 0 §8 behavior). tokenize_fn = opts.tokenize_fn, }, Context) end function Context:estimate_tokens() if self.tokenize_fn then local n = self.tokenize_fn(self.system_prompt) for _, t in ipairs(self.turns) do n = n + self.tokenize_fn(t.content) end return n end -- char/4 fallback (existing behavior) local n = #self.system_prompt for _, t in ipairs(self.turns) do n = n + #t.content end return math.floor(n / 4) end ``` Performance note: with N turns of average ~500 chars each, `estimate_tokens` fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for per-prompt eviction checks. **Mitigation**: per-turn token count cached ON the turn dict (`turn._tokens`) the first time it's counted; only re-tokenized if `turn._tokens` is nil. Set at append time when tokenize_fn is present; otherwise lazily on first estimate. --- ## 6. Pillar 4 — `:cost detail` est-vs-actual Current `:cost detail` (Phase 7) shows: ``` anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219 ``` The `179 / 8` is `prompt_tokens / completion_tokens` from the actual broker usage payload. Phase 8 extension: for each slot, also compute an estimated count via `broker.token_count(model_cfg_for_this_slot, ...)` over the TURNS THAT CONTRIBUTED to this slot. But that's stateful and expensive — simpler: show the SUM of `prompt_tokens` (actual) and the SUM of `estimate_tokens()` (heuristic OR endpoint-based, depending on what tokenize_fn is wired). If disagreement >10%, annotate. Simplified format: ``` anthropic/claude-haiku-4.5 main 1 calls, 179 ~est=164 / 8 tokens, $0.000219 ^^^^^^^^^^ shown when |actual-est|/actual > 0.10 ``` The `~est=N` annotation only renders when the disagreement exceeds the 10% threshold. Silent otherwise. --- ## 7. UX Surface Summary | Meta | Behavior change | |---|---| | `:cost detail` | Adds `~est=N` annotation per slot when heuristic disagreement >10% | | (no new metas in v1) | | | Config | Default | Effect | |---|---|---| | `cfg.tokenize.use_endpoint` | false | When true, repl.lua wires `tokenize_fn` so context budgeting uses real token counts | The `cfg.tokenize` block being opt-in is conservative: enabling it means every `Context:estimate_tokens()` call may hit the broker. For local llama.cpp the cost is ~50ms; for cloud-only configurations there IS no /tokenize endpoint so we silently fall through to char/4 (cached after one probe). No surprise; document in config example. --- ## 8. Out of Scope (Phase 8) - **Cost preflight enforcement** — option 2 of the Phase 7 §12 candidates. The tokenize work here is a PREREQUISITE for accurate preflight cost estimation, but the enforcement layer itself (cap_at_dollars that REFUSES the call) is its own surface — defer to a separate phase. - **Cross-session cost rollup** — option 1 of Phase 7 §12 candidates. Independent of tokenization. - **Streaming tokenize** — some servers expose streaming tokenize endpoints for partial-prompt token counts during generation. Out of scope here; we use the blocking /tokenize for batch estimates. - **Multi-tokenizer support** (e.g. tiktoken for OpenAI compat, sentencepiece for HuggingFace) — would require vendoring a C library (violates PHASE0 §3) or shelling out to python. Endpoint-based is the only substrate-compliant option for accuracy beyond char/4. - **Tokenization for `:cost detail` rows that span multiple turns** — the actual `prompt_tokens` in the accumulator slot is the sum ACROSS calls; the estimate for comparison should be over the CURRENT ctx content. Show the per-call comparison only. --- ## 9. Risks | Risk | Mitigation | |---|---| | `/tokenize` 404 silently cached as `tokenize_supported = false` for a typo'd endpoint config | Per-session cache; restart re-probes. Acceptable. | | Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency | `turn._tokens` per-turn cache set at append-time; only re-tokenize on cache miss. | | Hossenfelder proxy may forward `/tokenize` differently than direct llama.cpp (e.g., adds `/v1/` prefix expected) | B1 confirms `/tokenize` works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback). | | Cloud models without /tokenize emit no probes after first 404 — fine but `:cost detail` est-vs-actual will always agree (both are char/4 then) | Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. | | `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. | | Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. | | A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. | --- ## 10. Open Questions (Phase 8) | # | Question | Impact | Resolution target | |---|---|---|---| | Q-T1 | Per-turn `_tokens` cache across `:reset` | A4 — dies with turns; new turns get nil and lazy-set on first count. Trivial. | | Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. | | Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. | | Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. | | Q-T5 | `/tokenize` honors `model` field | **Baseline** (not yet probed in detail — assumed yes since llama.cpp echoes `model` in completions; needs explicit multi-model probe). | | Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. | --- ## 11. Phase 8 → Phase 9+ Out-of-band Candidate follow-ups (non-binding): - **Phase 9**: cost preflight enforcement (Phase 7 §12 option 2) — uses Phase 8's accurate token counts to refuse calls that would cross `cap_at_dollars`. The accuracy work here is the foundation. - **Cross-session cost rollup** (Phase 7 §12 option 1) — independent; could land in parallel. - **Phase X**: project-local config overlay (`.aish.lua`) — was the alternative scope to Phase 7's cost work. Still valuable but independent of any current line. Phase 8 itself is self-contained — no upstream dependencies.