Files
aish/docs/PHASE8.md
T
marfrit 00869ba412 docs/PHASE8: formulate — accurate tokenization (resolves Q1)
Phase 8 formulate manifest + PHASE0 §11 amendment to add the Phase 8
row (substrate amendment per CLAUDE.md §3 lands same commit).

Four pillars:

  1. Per-endpoint /tokenize probe (cached). One round-trip on first
     call per (endpoint, model); capability cached for session.
     hossenfelder + llama.cpp expose <endpoint>/tokenize (NOT /v1/
     tokenize — per real probe; the path is endpoint-local, not
     under the OpenAI /v1 prefix). Cloud (OpenRouter) 404s — silent
     char/4 fallback.

  2. broker.token_count(model_cfg, text) — thin wrapper; tries probe,
     falls back to char/4 on miss. Always returns non-negative int;
     never errors. 2s tight timeout; failures cache as not-supported.

  3. Context:estimate_tokens widened. Accepts optional tokenize_fn at
     Context.new; uses it when present, char/4 otherwise. repl.lua
     wires `tokenize_fn = function(text) return broker.token_count(
     active_cfg, text) end` when cfg.tokenize.use_endpoint = true.
     Per-turn _tokens cache to amortize across estimate calls.

  4. :cost detail est-vs-actual annotation. When the heuristic
     disagrees with the actual prompt_tokens from broker usage by
     >10%, show `~est=N`. Silent otherwise. Display-only; no
     behavior change.

Resolves Q1 (PHASE0 §13, originally Phase 3) — replace char/4
heuristic on Context:estimate_tokens. Originally targeted at Phase 3
but deferred forward each iteration; now lands.

Baseline already observed during formulate:
  - /v1/tokenize -> 404 on hossenfelder; /tokenize -> works
  - Body shape: {content: "..."} returns {tokens: [N1, N2, ...]}
  - Accuracy gap: char/4 UNDERESTIMATES by ~10% on real code/prose
    (508 vs 558 on a 2KB README sample). Material for context-
    budget eviction decisions.

Doc covers scope + done-when, tech decisions table, module changes,
per-pillar deep dives, UX surface, out of scope, 6 risk rows, 6
open questions (Q-T4/T5 baseline-bound, others analyze-bound).

Scope confirmed via AskUserQuestion: tokenization (chosen over
cross-session cost persistence and hard rate-limit enforcement).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:19:53 +00:00

304 lines
16 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# aish — Phase 8 Manifest
**Project:** aish — AI-augmented conversational shell
**Document:** Phase 8 Requirements, Architecture & Design Decisions
**Status:** Formulate (pre-analyze)
**Date:** 2026-05-16
PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest
specifies what Phase 8 adds — **accurate tokenization**: replace the
char/4 heuristic on `Context:estimate_tokens()` with a per-broker
`/tokenize` round-trip where supported, char/4 fallback otherwise.
Resolves Q1 (`PHASE0.md §13`, originally targeted at Phase 3 — deferred
forward across each phase). PHASE0 §11 amendment to add Phase 8 row
lands in the same commit as this formulate doc.
---
## 1. Scope of Phase 8
Four pillars:
1. **Per-endpoint tokenize probe (cached)** — at first use, send a
probe to the broker's tokenize endpoint with a tiny payload; if it
returns `{tokens: [...]}` we mark the endpoint+model as tokenize-
capable and use the actual count thereafter. If it 404s or errors,
mark the slot as `tokenize_supported = false` and fall through to
char/4 silently. Cached per `(endpoint, model)` for the session.
2. **`broker.token_count(model_cfg, text)`** — thin wrapper that
returns an accurate token count when the (endpoint, model) is
tokenize-capable, else the char/4 heuristic. Always returns a
non-negative integer; never errors. The probe + fallback is
transparent to callers.
3. **`Context:estimate_tokens()` widening** — currently char/4 over
`system_prompt` + sum of `turn.content`s. The new shape accepts
an optional `tokenize_fn` (callback) at `Context.new` time and uses
it when present; falls back to char/4 when nil. `repl.lua` wires
`tokenize_fn = function(text) return broker.token_count(active_cfg, text) end`.
This means the active model's tokenizer is used for budgeting
decisions, which matches the broker the next ask_ai will hit.
4. **`:cost detail` estimated-vs-actual column** — for each
(model, category) slot in the accumulator, the actual
`prompt_tokens` from broker usage is already stored. Add an
estimated column computed via `broker.token_count` on the
currently-buffered prompt-shape. Disagreement >10% surfaces in a
tiny `~est=N` annotation so users can see when the heuristic
diverges from reality. Display-only; no behavior change.
**Phase 8 is done when:**
- A long-running session with the local `qwen-coder-7b-snappy-8k`
model evicts at the RIGHT moment (actual context fills the
budget) rather than ~10% late per the baseline gap.
- `broker.token_count(local_cfg, "hello world")` returns 2 (matches
the live tokenize result, not the char/4=2 coincidence — verify
via `:cost detail` against multi-paragraph text).
- `broker.token_count(cloud_cfg, "hello world")` returns 2 (char/4
fallback when /tokenize 404s, which it does for OpenRouter).
- Cached per-endpoint capability — the probe fires once per
endpoint per session, not per call.
- Existing configs without `cfg.tokenize` behave like Phase 7 (zero
behavior change unless opted in via `cfg.tokenize.use_endpoint = true`).
- `:cost detail` shows estimated-vs-actual where disagreement >10%,
silent otherwise.
---
## 2. Technology Decisions (delta from Phase 7)
| Decision | Choice | Rationale |
|---|---|---|
| Tokenize endpoint path | `<endpoint>/tokenize` (NOT `<endpoint>/v1/tokenize`) | Per real probe against hossenfelder: `/v1/tokenize` returns 404; `/tokenize` returns `{tokens: [...]}`. This is the llama.cpp server convention. |
| Request body shape | `{"content": "<text>", "model": "<model>"}` | Local model echoed via `model`; llama.cpp ignores it but harmless. Probed shape works. |
| Capability detection | Per-call optimistic probe; on 404/non-200, cache `tokenize_supported[endpoint][model] = false` and never retry that session | One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine. |
| Fallback heuristic | char/4 (Phase 0 §8 convention) | Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available. |
| `Context:estimate_tokens` calling convention | Optional `tokenize_fn` callback at Context.new; absent = char/4 (existing behavior) | Backward-compatible; no caller break. Opt-in via repl.lua. |
| Active-model tokenizer | repl.lua wires `tokenize_fn` against `active_cfg` (the currently active model), so eviction decisions match the broker the next call will hit | When the user `:model cloud` switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize). |
| Caching strategy | Endpoint+model capability flag only; NOT per-text token-count cache | Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint. |
| Per-text timeout cap | 2s for tokenize calls (much tighter than the model's normal timeout_ms) | Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4. |
| `:cost detail` est-vs-actual | Show only when disagreement >10%; format `(prompt: 558 ~est=508 / completion: 80)` for the disagreement case, `(prompt: 558 / completion: 80)` otherwise | Always-on noise; suppress when heuristic is close. |
| New config key | `cfg.tokenize = { use_endpoint = true }` — default false until user opts in | Network round-trip cost; user-acknowledged behavior change. |
---
## 3. Module Changes
| File | State after Phase 7 | Phase 8 changes |
|---|---|---|
| `broker.lua` | `chat`, `chat_stream`, `build_request` (opts-widened in Phase 7) | New `M.token_count(model_cfg, text)`: tries `<endpoint>/tokenize` once per (endpoint, model); caches capability; returns int. New `M.tokenize_supported(model_cfg)` introspection helper for tests. |
| `context.lua` | `estimate_tokens()` char/4 sum over system_prompt + turn.contents | Widen to use `self.tokenize_fn(text)` if present; else char/4. New `tokenize_fn` field on Context (set at Context.new from opts). |
| `repl.lua` | wires Context.new with summarize_fn, hosts all metas | `tokenize_fn` wired into Context.new when `cfg.tokenize.use_endpoint = true`. `:cost detail` extended with est-vs-actual column. |
| `config.lua` | Phase 7 cost block example | Add commented-out `tokenize = { use_endpoint = true }` block. |
| `docs/PHASE0.md` | §11 lists phases 0-7 | Amendment: add Phase 8 row to §11. |
No new module files.
---
## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)`
```lua
-- Per-endpoint capability cache (session-scoped local in broker.lua)
local _tokenize_capable = {} -- [endpoint .. "/" .. model] = true | false
local function _cache_key(model_cfg)
return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
end
function M.token_count(model_cfg, text)
text = text or ""
if text == "" then return 0 end
if not (model_cfg and model_cfg.endpoint) then
return math.floor(#text / 4) -- pure fallback
end
local key = _cache_key(model_cfg)
local cap = _tokenize_capable[key]
if cap == false then
return math.floor(#text / 4)
end
-- cap == nil OR cap == true; try the endpoint.
local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
local body = json.encode({ content = text, model = model_cfg.model })
local out, status = curl.post(url, body,
{ "Content-Type: application/json" },
2000) -- 2s timeout
if not (status == 200 and out) then
_tokenize_capable[key] = false
return math.floor(#text / 4)
end
local doc = json.decode(out)
local toks = doc and doc.tokens
if type(toks) ~= "table" then
_tokenize_capable[key] = false
return math.floor(#text / 4)
end
_tokenize_capable[key] = true
return #toks
end
function M.tokenize_supported(model_cfg)
if not model_cfg then return nil end
return _tokenize_capable[_cache_key(model_cfg)]
end
```
Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status).
---
## 5. Pillar 3 — `Context:estimate_tokens` widening
```lua
function M.new(opts)
...
return setmetatable({
...
-- Phase 8: optional callback that returns an accurate token
-- count for a given text. Set by repl.lua when cfg.tokenize.
-- use_endpoint=true, calling broker.token_count(active_cfg, ...).
-- nil = char/4 fallback (Phase 0 §8 behavior).
tokenize_fn = opts.tokenize_fn,
}, Context)
end
function Context:estimate_tokens()
if self.tokenize_fn then
local n = self.tokenize_fn(self.system_prompt)
for _, t in ipairs(self.turns) do
n = n + self.tokenize_fn(t.content)
end
return n
end
-- char/4 fallback (existing behavior)
local n = #self.system_prompt
for _, t in ipairs(self.turns) do n = n + #t.content end
return math.floor(n / 4)
end
```
Performance note: with N turns of average ~500 chars each, `estimate_tokens`
fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for
per-prompt eviction checks. **Mitigation**: per-turn token count cached
ON the turn dict (`turn._tokens`) the first time it's counted; only
re-tokenized if `turn._tokens` is nil. Set at append time when
tokenize_fn is present; otherwise lazily on first estimate.
---
## 6. Pillar 4 — `:cost detail` est-vs-actual
Current `:cost detail` (Phase 7) shows:
```
anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219
```
The `179 / 8` is `prompt_tokens / completion_tokens` from the actual
broker usage payload.
Phase 8 extension: for each slot, also compute an estimated count via
`broker.token_count(model_cfg_for_this_slot, ...)` over the
TURNS THAT CONTRIBUTED to this slot. But that's stateful and
expensive — simpler: show the SUM of `prompt_tokens` (actual) and
the SUM of `estimate_tokens()` (heuristic OR endpoint-based, depending
on what tokenize_fn is wired). If disagreement >10%, annotate.
Simplified format:
```
anthropic/claude-haiku-4.5 main 1 calls, 179 ~est=164 / 8 tokens, $0.000219
^^^^^^^^^^ shown when |actual-est|/actual > 0.10
```
The `~est=N` annotation only renders when the disagreement exceeds the
10% threshold. Silent otherwise.
---
## 7. UX Surface Summary
| Meta | Behavior change |
|---|---|
| `:cost detail` | Adds `~est=N` annotation per slot when heuristic disagreement >10% |
| (no new metas in v1) | |
| Config | Default | Effect |
|---|---|---|
| `cfg.tokenize.use_endpoint` | false | When true, repl.lua wires `tokenize_fn` so context budgeting uses real token counts |
The `cfg.tokenize` block being opt-in is conservative: enabling it
means every `Context:estimate_tokens()` call may hit the broker. For
local llama.cpp the cost is ~50ms; for cloud-only configurations there
IS no /tokenize endpoint so we silently fall through to char/4 (cached
after one probe). No surprise; document in config example.
---
## 8. Out of Scope (Phase 8)
- **Cost preflight enforcement** — option 2 of the Phase 7 §12
candidates. The tokenize work here is a PREREQUISITE for accurate
preflight cost estimation, but the enforcement layer itself
(cap_at_dollars that REFUSES the call) is its own surface — defer
to a separate phase.
- **Cross-session cost rollup** — option 1 of Phase 7 §12 candidates.
Independent of tokenization.
- **Streaming tokenize** — some servers expose streaming tokenize
endpoints for partial-prompt token counts during generation. Out
of scope here; we use the blocking /tokenize for batch estimates.
- **Multi-tokenizer support** (e.g. tiktoken for OpenAI compat,
sentencepiece for HuggingFace) — would require vendoring a C library
(violates PHASE0 §3) or shelling out to python. Endpoint-based is
the only substrate-compliant option for accuracy beyond char/4.
- **Tokenization for `:cost detail` rows that span multiple turns**
— the actual `prompt_tokens` in the accumulator slot is the sum
ACROSS calls; the estimate for comparison should be over the
CURRENT ctx content. Show the per-call comparison only.
---
## 9. Risks
| Risk | Mitigation |
|---|---|
| `/tokenize` 404 silently cached as `tokenize_supported = false` for a typo'd endpoint config | Per-session cache; restart re-probes. Acceptable. |
| Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency | `turn._tokens` per-turn cache set at append-time; only re-tokenize on cache miss. |
| Hossenfelder proxy may forward `/tokenize` differently than direct llama.cpp (e.g., adds `/v1/` prefix expected) | B1 confirms `/tokenize` works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback). |
| Cloud models without /tokenize emit no probes after first 404 — fine but `:cost detail` est-vs-actual will always agree (both are char/4 then) | Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. |
| `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
| Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
---
## 10. Open Questions (Phase 8)
| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-T1 | Should the per-turn `_tokens` cache survive across `:reset`? `:reset` clears `turns` anyway, so the cache dies with them. But if turns get appended again, do we re-tokenize from scratch? | Cache lifecycle | Analyze (probably trivially yes — new turns get new cache entries) |
| Q-T2 | When the active model changes (`:model cloud`), should the `tokenize_fn` re-bind to the new model's tokenizer? The wiring is set once at Context.new time. | Eviction accuracy after `:model` switch | Analyze (the lambda captures `active_cfg` upval — Lua's closure semantics resolve at call time, so YES it follows :model switches naturally) |
| Q-T3 | Should the probe respect `cfg.tokenize.use_endpoint = false` and skip even the probe? Or always probe and just not USE the result if disabled? | Network behavior with config opt-out | Analyze (skip the probe entirely — opt-out means opt-out) |
| Q-T4 | What's the actual round-trip latency for `/tokenize` against the live broker for typical aish turn sizes (~500 chars)? | Performance model | Baseline |
| Q-T5 | Does hossenfelder's `/tokenize` accept the `model` field, or does it use whichever model is currently loaded? | Multi-model accuracy | Baseline |
| Q-T6 | Should `broker.token_count` also accept a TOOLS-array param so estimates include tool-schema tokens (which the chat_completion sends)? | Eviction accuracy with MCP tools | Analyze |
---
## 11. Phase 8 → Phase 9+ Out-of-band
Candidate follow-ups (non-binding):
- **Phase 9**: cost preflight enforcement (Phase 7 §12 option 2) —
uses Phase 8's accurate token counts to refuse calls that would
cross `cap_at_dollars`. The accuracy work here is the foundation.
- **Cross-session cost rollup** (Phase 7 §12 option 1) — independent;
could land in parallel.
- **Phase X**: project-local config overlay (`.aish.lua`) — was the
alternative scope to Phase 7's cost work. Still valuable but
independent of any current line.
Phase 8 itself is self-contained — no upstream dependencies.