docs/PHASE8: formulate — accurate tokenization (resolves Q1)

Phase 8 formulate manifest + PHASE0 §11 amendment to add the Phase 8 row (substrate amendment per CLAUDE.md §3 lands same commit). Four pillars: 1. Per-endpoint /tokenize probe (cached). One round-trip on first call per (endpoint, model); capability cached for session. hossenfelder + llama.cpp expose <endpoint>/tokenize (NOT /v1/ tokenize — per real probe; the path is endpoint-local, not under the OpenAI /v1 prefix). Cloud (OpenRouter) 404s — silent char/4 fallback. 2. broker.token_count(model_cfg, text) — thin wrapper; tries probe, falls back to char/4 on miss. Always returns non-negative int; never errors. 2s tight timeout; failures cache as not-supported. 3. Context:estimate_tokens widened. Accepts optional tokenize_fn at Context.new; uses it when present, char/4 otherwise. repl.lua wires `tokenize_fn = function(text) return broker.token_count( active_cfg, text) end` when cfg.tokenize.use_endpoint = true. Per-turn _tokens cache to amortize across estimate calls. 4. :cost detail est-vs-actual annotation. When the heuristic disagrees with the actual prompt_tokens from broker usage by >10%, show `~est=N`. Silent otherwise. Display-only; no behavior change. Resolves Q1 (PHASE0 §13, originally Phase 3) — replace char/4 heuristic on Context:estimate_tokens. Originally targeted at Phase 3 but deferred forward each iteration; now lands. Baseline already observed during formulate: - /v1/tokenize -> 404 on hossenfelder; /tokenize -> works - Body shape: {content: "..."} returns {tokens: [N1, N2, ...]} - Accuracy gap: char/4 UNDERESTIMATES by ~10% on real code/prose (508 vs 558 on a 2KB README sample). Material for context- budget eviction decisions. Doc covers scope + done-when, tech decisions table, module changes, per-pillar deep dives, UX surface, out of scope, 6 risk rows, 6 open questions (Q-T4/T5 baseline-bound, others analyze-bound). Scope confirmed via AskUserQuestion: tokenization (chosen over cross-session cost persistence and hard rate-limit enforcement). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:19:53 +00:00
parent 1f34b6dce8
commit 00869ba412
2 changed files with 304 additions and 0 deletions
@@ -317,6 +317,7 @@ from somewhere else.
 | **5** | Multi-model routing by task type, cloud fallback, context summarization via fast model on eviction |
 | **6** | Tree-sitter syntax highlighting hooks, diff-aware code injection, project-level context (file tree summary) |
 | **7** | Cost / usage observability: broker captures `usage` + `cost`; per-session accumulator on ctx; `:cost` reporter; optional warn thresholds |
+| **8** | Accurate tokenization: per-endpoint `/tokenize` probe (cached); `broker.token_count`; `Context:estimate_tokens` widened; `:cost detail` est-vs-actual annotation |

 ---

@@ -0,0 +1,303 @@
+# aish — Phase 8 Manifest
+
+**Project:** aish — AI-augmented conversational shell
+**Document:** Phase 8 Requirements, Architecture & Design Decisions
+**Status:** Formulate (pre-analyze)
+**Date:** 2026-05-16
+
+PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest
+specifies what Phase 8 adds — **accurate tokenization**: replace the
+char/4 heuristic on `Context:estimate_tokens()` with a per-broker
+`/tokenize` round-trip where supported, char/4 fallback otherwise.
+
+Resolves Q1 (`PHASE0.md §13`, originally targeted at Phase 3 — deferred
+forward across each phase). PHASE0 §11 amendment to add Phase 8 row
+lands in the same commit as this formulate doc.
+
+---
+
+## 1. Scope of Phase 8
+
+Four pillars:
+
+1. **Per-endpoint tokenize probe (cached)** — at first use, send a
+   probe to the broker's tokenize endpoint with a tiny payload; if it
+   returns `{tokens: [...]}` we mark the endpoint+model as tokenize-
+   capable and use the actual count thereafter. If it 404s or errors,
+   mark the slot as `tokenize_supported = false` and fall through to
+   char/4 silently. Cached per `(endpoint, model)` for the session.
+
+2. **`broker.token_count(model_cfg, text)`** — thin wrapper that
+   returns an accurate token count when the (endpoint, model) is
+   tokenize-capable, else the char/4 heuristic. Always returns a
+   non-negative integer; never errors. The probe + fallback is
+   transparent to callers.
+
+3. **`Context:estimate_tokens()` widening** — currently char/4 over
+   `system_prompt` + sum of `turn.content`s. The new shape accepts
+   an optional `tokenize_fn` (callback) at `Context.new` time and uses
+   it when present; falls back to char/4 when nil. `repl.lua` wires
+   `tokenize_fn = function(text) return broker.token_count(active_cfg, text) end`.
+   This means the active model's tokenizer is used for budgeting
+   decisions, which matches the broker the next ask_ai will hit.
+
+4. **`:cost detail` estimated-vs-actual column** — for each
+   (model, category) slot in the accumulator, the actual
+   `prompt_tokens` from broker usage is already stored. Add an
+   estimated column computed via `broker.token_count` on the
+   currently-buffered prompt-shape. Disagreement >10% surfaces in a
+   tiny `~est=N` annotation so users can see when the heuristic
+   diverges from reality. Display-only; no behavior change.
+
+**Phase 8 is done when:**
+
+- A long-running session with the local `qwen-coder-7b-snappy-8k`
+  model evicts at the RIGHT moment (actual context fills the
+  budget) rather than ~10% late per the baseline gap.
+- `broker.token_count(local_cfg, "hello world")` returns 2 (matches
+  the live tokenize result, not the char/4=2 coincidence — verify
+  via `:cost detail` against multi-paragraph text).
+- `broker.token_count(cloud_cfg, "hello world")` returns 2 (char/4
+  fallback when /tokenize 404s, which it does for OpenRouter).
+- Cached per-endpoint capability — the probe fires once per
+  endpoint per session, not per call.
+- Existing configs without `cfg.tokenize` behave like Phase 7 (zero
+  behavior change unless opted in via `cfg.tokenize.use_endpoint = true`).
+- `:cost detail` shows estimated-vs-actual where disagreement >10%,
+  silent otherwise.
+
+---
+
+## 2. Technology Decisions (delta from Phase 7)
+
+| Decision | Choice | Rationale |
+|---|---|---|
+| Tokenize endpoint path | `<endpoint>/tokenize` (NOT `<endpoint>/v1/tokenize`) | Per real probe against hossenfelder: `/v1/tokenize` returns 404; `/tokenize` returns `{tokens: [...]}`. This is the llama.cpp server convention. |
+| Request body shape | `{"content": "<text>", "model": "<model>"}` | Local model echoed via `model`; llama.cpp ignores it but harmless. Probed shape works. |
+| Capability detection | Per-call optimistic probe; on 404/non-200, cache `tokenize_supported[endpoint][model] = false` and never retry that session | One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine. |
+| Fallback heuristic | char/4 (Phase 0 §8 convention) | Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available. |
+| `Context:estimate_tokens` calling convention | Optional `tokenize_fn` callback at Context.new; absent = char/4 (existing behavior) | Backward-compatible; no caller break. Opt-in via repl.lua. |
+| Active-model tokenizer | repl.lua wires `tokenize_fn` against `active_cfg` (the currently active model), so eviction decisions match the broker the next call will hit | When the user `:model cloud` switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize). |
+| Caching strategy | Endpoint+model capability flag only; NOT per-text token-count cache | Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint. |
+| Per-text timeout cap | 2s for tokenize calls (much tighter than the model's normal timeout_ms) | Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4. |
+| `:cost detail` est-vs-actual | Show only when disagreement >10%; format `(prompt: 558 ~est=508 / completion: 80)` for the disagreement case, `(prompt: 558 / completion: 80)` otherwise | Always-on noise; suppress when heuristic is close. |
+| New config key | `cfg.tokenize = { use_endpoint = true }` — default false until user opts in | Network round-trip cost; user-acknowledged behavior change. |
+
+---
+
+## 3. Module Changes
+
+| File | State after Phase 7 | Phase 8 changes |
+|---|---|---|
+| `broker.lua` | `chat`, `chat_stream`, `build_request` (opts-widened in Phase 7) | New `M.token_count(model_cfg, text)`: tries `<endpoint>/tokenize` once per (endpoint, model); caches capability; returns int. New `M.tokenize_supported(model_cfg)` introspection helper for tests. |
+| `context.lua` | `estimate_tokens()` char/4 sum over system_prompt + turn.contents | Widen to use `self.tokenize_fn(text)` if present; else char/4. New `tokenize_fn` field on Context (set at Context.new from opts). |
+| `repl.lua` | wires Context.new with summarize_fn, hosts all metas | `tokenize_fn` wired into Context.new when `cfg.tokenize.use_endpoint = true`. `:cost detail` extended with est-vs-actual column. |
+| `config.lua` | Phase 7 cost block example | Add commented-out `tokenize = { use_endpoint = true }` block. |
+| `docs/PHASE0.md` | §11 lists phases 0-7 | Amendment: add Phase 8 row to §11. |
+
+No new module files.
+
+---
+
+## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)`
+
+```lua
+-- Per-endpoint capability cache (session-scoped local in broker.lua)
+local _tokenize_capable = {}    -- [endpoint .. "/" .. model] = true | false
+
+local function _cache_key(model_cfg)
+    return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
+end
+
+function M.token_count(model_cfg, text)
+    text = text or ""
+    if text == "" then return 0 end
+    if not (model_cfg and model_cfg.endpoint) then
+        return math.floor(#text / 4)   -- pure fallback
+    end
+    local key = _cache_key(model_cfg)
+    local cap = _tokenize_capable[key]
+    if cap == false then
+        return math.floor(#text / 4)
+    end
+    -- cap == nil OR cap == true; try the endpoint.
+    local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
+    local body = json.encode({ content = text, model = model_cfg.model })
+    local out, status = curl.post(url, body,
+        { "Content-Type: application/json" },
+        2000)  -- 2s timeout
+    if not (status == 200 and out) then
+        _tokenize_capable[key] = false
+        return math.floor(#text / 4)
+    end
+    local doc = json.decode(out)
+    local toks = doc and doc.tokens
+    if type(toks) ~= "table" then
+        _tokenize_capable[key] = false
+        return math.floor(#text / 4)
+    end
+    _tokenize_capable[key] = true
+    return #toks
+end
+
+function M.tokenize_supported(model_cfg)
+    if not model_cfg then return nil end
+    return _tokenize_capable[_cache_key(model_cfg)]
+end
+```
+
+Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status).
+
+---
+
+## 5. Pillar 3 — `Context:estimate_tokens` widening
+
+```lua
+function M.new(opts)
+    ...
+    return setmetatable({
+        ...
+        -- Phase 8: optional callback that returns an accurate token
+        -- count for a given text. Set by repl.lua when cfg.tokenize.
+        -- use_endpoint=true, calling broker.token_count(active_cfg, ...).
+        -- nil = char/4 fallback (Phase 0 §8 behavior).
+        tokenize_fn          = opts.tokenize_fn,
+    }, Context)
+end
+
+function Context:estimate_tokens()
+    if self.tokenize_fn then
+        local n = self.tokenize_fn(self.system_prompt)
+        for _, t in ipairs(self.turns) do
+            n = n + self.tokenize_fn(t.content)
+        end
+        return n
+    end
+    -- char/4 fallback (existing behavior)
+    local n = #self.system_prompt
+    for _, t in ipairs(self.turns) do n = n + #t.content end
+    return math.floor(n / 4)
+end
+```
+
+Performance note: with N turns of average ~500 chars each, `estimate_tokens`
+fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for
+per-prompt eviction checks. **Mitigation**: per-turn token count cached
+ON the turn dict (`turn._tokens`) the first time it's counted; only
+re-tokenized if `turn._tokens` is nil. Set at append time when
+tokenize_fn is present; otherwise lazily on first estimate.
+
+---
+
+## 6. Pillar 4 — `:cost detail` est-vs-actual
+
+Current `:cost detail` (Phase 7) shows:
+
+```
+  anthropic/claude-haiku-4.5 main                 1 calls,    179 /      8 tokens, $0.000219
+```
+
+The `179 / 8` is `prompt_tokens / completion_tokens` from the actual
+broker usage payload.
+
+Phase 8 extension: for each slot, also compute an estimated count via
+`broker.token_count(model_cfg_for_this_slot, ...)` over the
+TURNS THAT CONTRIBUTED to this slot. But that's stateful and
+expensive — simpler: show the SUM of `prompt_tokens` (actual) and
+the SUM of `estimate_tokens()` (heuristic OR endpoint-based, depending
+on what tokenize_fn is wired). If disagreement >10%, annotate.
+
+Simplified format:
+
+```
+  anthropic/claude-haiku-4.5 main                 1 calls,    179 ~est=164 / 8 tokens, $0.000219
+                                                                    ^^^^^^^^^^ shown when |actual-est|/actual > 0.10
+```
+
+The `~est=N` annotation only renders when the disagreement exceeds the
+10% threshold. Silent otherwise.
+
+---
+
+## 7. UX Surface Summary
+
+| Meta | Behavior change |
+|---|---|
+| `:cost detail` | Adds `~est=N` annotation per slot when heuristic disagreement >10% |
+| (no new metas in v1) | |
+
+| Config | Default | Effect |
+|---|---|---|
+| `cfg.tokenize.use_endpoint` | false | When true, repl.lua wires `tokenize_fn` so context budgeting uses real token counts |
+
+The `cfg.tokenize` block being opt-in is conservative: enabling it
+means every `Context:estimate_tokens()` call may hit the broker. For
+local llama.cpp the cost is ~50ms; for cloud-only configurations there
+IS no /tokenize endpoint so we silently fall through to char/4 (cached
+after one probe). No surprise; document in config example.
+
+---
+
+## 8. Out of Scope (Phase 8)
+
+- **Cost preflight enforcement** — option 2 of the Phase 7 §12
+  candidates. The tokenize work here is a PREREQUISITE for accurate
+  preflight cost estimation, but the enforcement layer itself
+  (cap_at_dollars that REFUSES the call) is its own surface — defer
+  to a separate phase.
+- **Cross-session cost rollup** — option 1 of Phase 7 §12 candidates.
+  Independent of tokenization.
+- **Streaming tokenize** — some servers expose streaming tokenize
+  endpoints for partial-prompt token counts during generation. Out
+  of scope here; we use the blocking /tokenize for batch estimates.
+- **Multi-tokenizer support** (e.g. tiktoken for OpenAI compat,
+  sentencepiece for HuggingFace) — would require vendoring a C library
+  (violates PHASE0 §3) or shelling out to python. Endpoint-based is
+  the only substrate-compliant option for accuracy beyond char/4.
+- **Tokenization for `:cost detail` rows that span multiple turns**
+  — the actual `prompt_tokens` in the accumulator slot is the sum
+  ACROSS calls; the estimate for comparison should be over the
+  CURRENT ctx content. Show the per-call comparison only.
+
+---
+
+## 9. Risks
+
+| Risk | Mitigation |
+|---|---|
+| `/tokenize` 404 silently cached as `tokenize_supported = false` for a typo'd endpoint config | Per-session cache; restart re-probes. Acceptable. |
+| Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency | `turn._tokens` per-turn cache set at append-time; only re-tokenize on cache miss. |
+| Hossenfelder proxy may forward `/tokenize` differently than direct llama.cpp (e.g., adds `/v1/` prefix expected) | B1 confirms `/tokenize` works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback). |
+| Cloud models without /tokenize emit no probes after first 404 — fine but `:cost detail` est-vs-actual will always agree (both are char/4 then) | Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. |
+| `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
+| Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
+
+---
+
+## 10. Open Questions (Phase 8)
+
+| # | Question | Impact | Resolution target |
+|---|---|---|---|
+| Q-T1 | Should the per-turn `_tokens` cache survive across `:reset`? `:reset` clears `turns` anyway, so the cache dies with them. But if turns get appended again, do we re-tokenize from scratch? | Cache lifecycle | Analyze (probably trivially yes — new turns get new cache entries) |
+| Q-T2 | When the active model changes (`:model cloud`), should the `tokenize_fn` re-bind to the new model's tokenizer? The wiring is set once at Context.new time. | Eviction accuracy after `:model` switch | Analyze (the lambda captures `active_cfg` upval — Lua's closure semantics resolve at call time, so YES it follows :model switches naturally) |
+| Q-T3 | Should the probe respect `cfg.tokenize.use_endpoint = false` and skip even the probe? Or always probe and just not USE the result if disabled? | Network behavior with config opt-out | Analyze (skip the probe entirely — opt-out means opt-out) |
+| Q-T4 | What's the actual round-trip latency for `/tokenize` against the live broker for typical aish turn sizes (~500 chars)? | Performance model | Baseline |
+| Q-T5 | Does hossenfelder's `/tokenize` accept the `model` field, or does it use whichever model is currently loaded? | Multi-model accuracy | Baseline |
+| Q-T6 | Should `broker.token_count` also accept a TOOLS-array param so estimates include tool-schema tokens (which the chat_completion sends)? | Eviction accuracy with MCP tools | Analyze |
+
+---
+
+## 11. Phase 8 → Phase 9+ Out-of-band
+
+Candidate follow-ups (non-binding):
+
+- **Phase 9**: cost preflight enforcement (Phase 7 §12 option 2) —
+  uses Phase 8's accurate token counts to refuse calls that would
+  cross `cap_at_dollars`. The accuracy work here is the foundation.
+- **Cross-session cost rollup** (Phase 7 §12 option 1) — independent;
+  could land in parallel.
+- **Phase X**: project-local config overlay (`.aish.lua`) — was the
+  alternative scope to Phase 7's cost work. Still valuable but
+  independent of any current line.
+
+Phase 8 itself is self-contained — no upstream dependencies.