Phase 8 formulate manifest + PHASE0 §11 amendment to add the Phase 8
row (substrate amendment per CLAUDE.md §3 lands same commit).
Four pillars:
1. Per-endpoint /tokenize probe (cached). One round-trip on first
call per (endpoint, model); capability cached for session.
hossenfelder + llama.cpp expose <endpoint>/tokenize (NOT /v1/
tokenize — per real probe; the path is endpoint-local, not
under the OpenAI /v1 prefix). Cloud (OpenRouter) 404s — silent
char/4 fallback.
2. broker.token_count(model_cfg, text) — thin wrapper; tries probe,
falls back to char/4 on miss. Always returns non-negative int;
never errors. 2s tight timeout; failures cache as not-supported.
3. Context:estimate_tokens widened. Accepts optional tokenize_fn at
Context.new; uses it when present, char/4 otherwise. repl.lua
wires `tokenize_fn = function(text) return broker.token_count(
active_cfg, text) end` when cfg.tokenize.use_endpoint = true.
Per-turn _tokens cache to amortize across estimate calls.
4. :cost detail est-vs-actual annotation. When the heuristic
disagrees with the actual prompt_tokens from broker usage by
>10%, show `~est=N`. Silent otherwise. Display-only; no
behavior change.
Resolves Q1 (PHASE0 §13, originally Phase 3) — replace char/4
heuristic on Context:estimate_tokens. Originally targeted at Phase 3
but deferred forward each iteration; now lands.
Baseline already observed during formulate:
- /v1/tokenize -> 404 on hossenfelder; /tokenize -> works
- Body shape: {content: "..."} returns {tokens: [N1, N2, ...]}
- Accuracy gap: char/4 UNDERESTIMATES by ~10% on real code/prose
(508 vs 558 on a 2KB README sample). Material for context-
budget eviction decisions.
Doc covers scope + done-when, tech decisions table, module changes,
per-pillar deep dives, UX surface, out of scope, 6 risk rows, 6
open questions (Q-T4/T5 baseline-bound, others analyze-bound).
Scope confirmed via AskUserQuestion: tokenization (chosen over
cross-session cost persistence and hard rate-limit enforcement).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
16 KiB
aish — Phase 8 Manifest
Project: aish — AI-augmented conversational shell Document: Phase 8 Requirements, Architecture & Design Decisions Status: Formulate (pre-analyze) Date: 2026-05-16
PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest
specifies what Phase 8 adds — accurate tokenization: replace the
char/4 heuristic on Context:estimate_tokens() with a per-broker
/tokenize round-trip where supported, char/4 fallback otherwise.
Resolves Q1 (PHASE0.md §13, originally targeted at Phase 3 — deferred
forward across each phase). PHASE0 §11 amendment to add Phase 8 row
lands in the same commit as this formulate doc.
1. Scope of Phase 8
Four pillars:
-
Per-endpoint tokenize probe (cached) — at first use, send a probe to the broker's tokenize endpoint with a tiny payload; if it returns
{tokens: [...]}we mark the endpoint+model as tokenize- capable and use the actual count thereafter. If it 404s or errors, mark the slot astokenize_supported = falseand fall through to char/4 silently. Cached per(endpoint, model)for the session. -
broker.token_count(model_cfg, text)— thin wrapper that returns an accurate token count when the (endpoint, model) is tokenize-capable, else the char/4 heuristic. Always returns a non-negative integer; never errors. The probe + fallback is transparent to callers. -
Context:estimate_tokens()widening — currently char/4 oversystem_prompt+ sum ofturn.contents. The new shape accepts an optionaltokenize_fn(callback) atContext.newtime and uses it when present; falls back to char/4 when nil.repl.luawirestokenize_fn = function(text) return broker.token_count(active_cfg, text) end. This means the active model's tokenizer is used for budgeting decisions, which matches the broker the next ask_ai will hit. -
:cost detailestimated-vs-actual column — for each (model, category) slot in the accumulator, the actualprompt_tokensfrom broker usage is already stored. Add an estimated column computed viabroker.token_counton the currently-buffered prompt-shape. Disagreement >10% surfaces in a tiny~est=Nannotation so users can see when the heuristic diverges from reality. Display-only; no behavior change.
Phase 8 is done when:
- A long-running session with the local
qwen-coder-7b-snappy-8kmodel evicts at the RIGHT moment (actual context fills the budget) rather than ~10% late per the baseline gap. broker.token_count(local_cfg, "hello world")returns 2 (matches the live tokenize result, not the char/4=2 coincidence — verify via:cost detailagainst multi-paragraph text).broker.token_count(cloud_cfg, "hello world")returns 2 (char/4 fallback when /tokenize 404s, which it does for OpenRouter).- Cached per-endpoint capability — the probe fires once per endpoint per session, not per call.
- Existing configs without
cfg.tokenizebehave like Phase 7 (zero behavior change unless opted in viacfg.tokenize.use_endpoint = true). :cost detailshows estimated-vs-actual where disagreement >10%, silent otherwise.
2. Technology Decisions (delta from Phase 7)
| Decision | Choice | Rationale |
|---|---|---|
| Tokenize endpoint path | <endpoint>/tokenize (NOT <endpoint>/v1/tokenize) |
Per real probe against hossenfelder: /v1/tokenize returns 404; /tokenize returns {tokens: [...]}. This is the llama.cpp server convention. |
| Request body shape | {"content": "<text>", "model": "<model>"} |
Local model echoed via model; llama.cpp ignores it but harmless. Probed shape works. |
| Capability detection | Per-call optimistic probe; on 404/non-200, cache tokenize_supported[endpoint][model] = false and never retry that session |
One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine. |
| Fallback heuristic | char/4 (Phase 0 §8 convention) | Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available. |
Context:estimate_tokens calling convention |
Optional tokenize_fn callback at Context.new; absent = char/4 (existing behavior) |
Backward-compatible; no caller break. Opt-in via repl.lua. |
| Active-model tokenizer | repl.lua wires tokenize_fn against active_cfg (the currently active model), so eviction decisions match the broker the next call will hit |
When the user :model cloud switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize). |
| Caching strategy | Endpoint+model capability flag only; NOT per-text token-count cache | Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint. |
| Per-text timeout cap | 2s for tokenize calls (much tighter than the model's normal timeout_ms) | Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4. |
:cost detail est-vs-actual |
Show only when disagreement >10%; format (prompt: 558 ~est=508 / completion: 80) for the disagreement case, (prompt: 558 / completion: 80) otherwise |
Always-on noise; suppress when heuristic is close. |
| New config key | cfg.tokenize = { use_endpoint = true } — default false until user opts in |
Network round-trip cost; user-acknowledged behavior change. |
3. Module Changes
| File | State after Phase 7 | Phase 8 changes |
|---|---|---|
broker.lua |
chat, chat_stream, build_request (opts-widened in Phase 7) |
New M.token_count(model_cfg, text): tries <endpoint>/tokenize once per (endpoint, model); caches capability; returns int. New M.tokenize_supported(model_cfg) introspection helper for tests. |
context.lua |
estimate_tokens() char/4 sum over system_prompt + turn.contents |
Widen to use self.tokenize_fn(text) if present; else char/4. New tokenize_fn field on Context (set at Context.new from opts). |
repl.lua |
wires Context.new with summarize_fn, hosts all metas | tokenize_fn wired into Context.new when cfg.tokenize.use_endpoint = true. :cost detail extended with est-vs-actual column. |
config.lua |
Phase 7 cost block example | Add commented-out tokenize = { use_endpoint = true } block. |
docs/PHASE0.md |
§11 lists phases 0-7 | Amendment: add Phase 8 row to §11. |
No new module files.
4. Pillar 1+2 — broker.token_count(model_cfg, text)
-- Per-endpoint capability cache (session-scoped local in broker.lua)
local _tokenize_capable = {} -- [endpoint .. "/" .. model] = true | false
local function _cache_key(model_cfg)
return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
end
function M.token_count(model_cfg, text)
text = text or ""
if text == "" then return 0 end
if not (model_cfg and model_cfg.endpoint) then
return math.floor(#text / 4) -- pure fallback
end
local key = _cache_key(model_cfg)
local cap = _tokenize_capable[key]
if cap == false then
return math.floor(#text / 4)
end
-- cap == nil OR cap == true; try the endpoint.
local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
local body = json.encode({ content = text, model = model_cfg.model })
local out, status = curl.post(url, body,
{ "Content-Type: application/json" },
2000) -- 2s timeout
if not (status == 200 and out) then
_tokenize_capable[key] = false
return math.floor(#text / 4)
end
local doc = json.decode(out)
local toks = doc and doc.tokens
if type(toks) ~= "table" then
_tokenize_capable[key] = false
return math.floor(#text / 4)
end
_tokenize_capable[key] = true
return #toks
end
function M.tokenize_supported(model_cfg)
if not model_cfg then return nil end
return _tokenize_capable[_cache_key(model_cfg)]
end
Uses Phase 1's ffi/curl.M.post (blocking POST, returns body + status).
5. Pillar 3 — Context:estimate_tokens widening
function M.new(opts)
...
return setmetatable({
...
-- Phase 8: optional callback that returns an accurate token
-- count for a given text. Set by repl.lua when cfg.tokenize.
-- use_endpoint=true, calling broker.token_count(active_cfg, ...).
-- nil = char/4 fallback (Phase 0 §8 behavior).
tokenize_fn = opts.tokenize_fn,
}, Context)
end
function Context:estimate_tokens()
if self.tokenize_fn then
local n = self.tokenize_fn(self.system_prompt)
for _, t in ipairs(self.turns) do
n = n + self.tokenize_fn(t.content)
end
return n
end
-- char/4 fallback (existing behavior)
local n = #self.system_prompt
for _, t in ipairs(self.turns) do n = n + #t.content end
return math.floor(n / 4)
end
Performance note: with N turns of average ~500 chars each, estimate_tokens
fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for
per-prompt eviction checks. Mitigation: per-turn token count cached
ON the turn dict (turn._tokens) the first time it's counted; only
re-tokenized if turn._tokens is nil. Set at append time when
tokenize_fn is present; otherwise lazily on first estimate.
6. Pillar 4 — :cost detail est-vs-actual
Current :cost detail (Phase 7) shows:
anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219
The 179 / 8 is prompt_tokens / completion_tokens from the actual
broker usage payload.
Phase 8 extension: for each slot, also compute an estimated count via
broker.token_count(model_cfg_for_this_slot, ...) over the
TURNS THAT CONTRIBUTED to this slot. But that's stateful and
expensive — simpler: show the SUM of prompt_tokens (actual) and
the SUM of estimate_tokens() (heuristic OR endpoint-based, depending
on what tokenize_fn is wired). If disagreement >10%, annotate.
Simplified format:
anthropic/claude-haiku-4.5 main 1 calls, 179 ~est=164 / 8 tokens, $0.000219
^^^^^^^^^^ shown when |actual-est|/actual > 0.10
The ~est=N annotation only renders when the disagreement exceeds the
10% threshold. Silent otherwise.
7. UX Surface Summary
| Meta | Behavior change |
|---|---|
:cost detail |
Adds ~est=N annotation per slot when heuristic disagreement >10% |
| (no new metas in v1) |
| Config | Default | Effect |
|---|---|---|
cfg.tokenize.use_endpoint |
false | When true, repl.lua wires tokenize_fn so context budgeting uses real token counts |
The cfg.tokenize block being opt-in is conservative: enabling it
means every Context:estimate_tokens() call may hit the broker. For
local llama.cpp the cost is ~50ms; for cloud-only configurations there
IS no /tokenize endpoint so we silently fall through to char/4 (cached
after one probe). No surprise; document in config example.
8. Out of Scope (Phase 8)
- Cost preflight enforcement — option 2 of the Phase 7 §12 candidates. The tokenize work here is a PREREQUISITE for accurate preflight cost estimation, but the enforcement layer itself (cap_at_dollars that REFUSES the call) is its own surface — defer to a separate phase.
- Cross-session cost rollup — option 1 of Phase 7 §12 candidates. Independent of tokenization.
- Streaming tokenize — some servers expose streaming tokenize endpoints for partial-prompt token counts during generation. Out of scope here; we use the blocking /tokenize for batch estimates.
- Multi-tokenizer support (e.g. tiktoken for OpenAI compat, sentencepiece for HuggingFace) — would require vendoring a C library (violates PHASE0 §3) or shelling out to python. Endpoint-based is the only substrate-compliant option for accuracy beyond char/4.
- Tokenization for
:cost detailrows that span multiple turns — the actualprompt_tokensin the accumulator slot is the sum ACROSS calls; the estimate for comparison should be over the CURRENT ctx content. Show the per-call comparison only.
9. Risks
| Risk | Mitigation |
|---|---|
/tokenize 404 silently cached as tokenize_supported = false for a typo'd endpoint config |
Per-session cache; restart re-probes. Acceptable. |
| Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency | turn._tokens per-turn cache set at append-time; only re-tokenize on cache miss. |
Hossenfelder proxy may forward /tokenize differently than direct llama.cpp (e.g., adds /v1/ prefix expected) |
B1 confirms /tokenize works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback). |
Cloud models without /tokenize emit no probes after first 404 — fine but :cost detail est-vs-actual will always agree (both are char/4 then) |
Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. |
Context:estimate_tokens callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips |
Per-turn cache makes amortized cost O(1) per turn after first count. |
Endpoint URL handling — currently endpoint .. "/v1/chat/completions" is hardcoded; tokenize uses endpoint .. "/tokenize" (no /v1) — asymmetric |
Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
10. Open Questions (Phase 8)
| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-T1 | Should the per-turn _tokens cache survive across :reset? :reset clears turns anyway, so the cache dies with them. But if turns get appended again, do we re-tokenize from scratch? |
Cache lifecycle | Analyze (probably trivially yes — new turns get new cache entries) |
| Q-T2 | When the active model changes (:model cloud), should the tokenize_fn re-bind to the new model's tokenizer? The wiring is set once at Context.new time. |
Eviction accuracy after :model switch |
Analyze (the lambda captures active_cfg upval — Lua's closure semantics resolve at call time, so YES it follows :model switches naturally) |
| Q-T3 | Should the probe respect cfg.tokenize.use_endpoint = false and skip even the probe? Or always probe and just not USE the result if disabled? |
Network behavior with config opt-out | Analyze (skip the probe entirely — opt-out means opt-out) |
| Q-T4 | What's the actual round-trip latency for /tokenize against the live broker for typical aish turn sizes (~500 chars)? |
Performance model | Baseline |
| Q-T5 | Does hossenfelder's /tokenize accept the model field, or does it use whichever model is currently loaded? |
Multi-model accuracy | Baseline |
| Q-T6 | Should broker.token_count also accept a TOOLS-array param so estimates include tool-schema tokens (which the chat_completion sends)? |
Eviction accuracy with MCP tools | Analyze |
11. Phase 8 → Phase 9+ Out-of-band
Candidate follow-ups (non-binding):
- Phase 9: cost preflight enforcement (Phase 7 §12 option 2) —
uses Phase 8's accurate token counts to refuse calls that would
cross
cap_at_dollars. The accuracy work here is the foundation. - Cross-session cost rollup (Phase 7 §12 option 1) — independent; could land in parallel.
- Phase X: project-local config overlay (
.aish.lua) — was the alternative scope to Phase 7's cost work. Still valuable but independent of any current line.
Phase 8 itself is self-contained — no upstream dependencies.