Files

T

marfrit 00869ba412 docs/PHASE8: formulate — accurate tokenization (resolves Q1)

Phase 8 formulate manifest + PHASE0 §11 amendment to add the Phase 8
row (substrate amendment per CLAUDE.md §3 lands same commit).

Four pillars:

  1. Per-endpoint /tokenize probe (cached). One round-trip on first
     call per (endpoint, model); capability cached for session.
     hossenfelder + llama.cpp expose <endpoint>/tokenize (NOT /v1/
     tokenize — per real probe; the path is endpoint-local, not
     under the OpenAI /v1 prefix). Cloud (OpenRouter) 404s — silent
     char/4 fallback.

  2. broker.token_count(model_cfg, text) — thin wrapper; tries probe,
     falls back to char/4 on miss. Always returns non-negative int;
     never errors. 2s tight timeout; failures cache as not-supported.

  3. Context:estimate_tokens widened. Accepts optional tokenize_fn at
     Context.new; uses it when present, char/4 otherwise. repl.lua
     wires `tokenize_fn = function(text) return broker.token_count(
     active_cfg, text) end` when cfg.tokenize.use_endpoint = true.
     Per-turn _tokens cache to amortize across estimate calls.

  4. :cost detail est-vs-actual annotation. When the heuristic
     disagrees with the actual prompt_tokens from broker usage by
     >10%, show `~est=N`. Silent otherwise. Display-only; no
     behavior change.

Resolves Q1 (PHASE0 §13, originally Phase 3) — replace char/4
heuristic on Context:estimate_tokens. Originally targeted at Phase 3
but deferred forward each iteration; now lands.

Baseline already observed during formulate:
  - /v1/tokenize -> 404 on hossenfelder; /tokenize -> works
  - Body shape: {content: "..."} returns {tokens: [N1, N2, ...]}
  - Accuracy gap: char/4 UNDERESTIMATES by ~10% on real code/prose
    (508 vs 558 on a 2KB README sample). Material for context-
    budget eviction decisions.

Doc covers scope + done-when, tech decisions table, module changes,
per-pillar deep dives, UX surface, out of scope, 6 risk rows, 6
open questions (Q-T4/T5 baseline-bound, others analyze-bound).

Scope confirmed via AskUserQuestion: tokenization (chosen over
cross-session cost persistence and hard rate-limit enforcement).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-16 23:19:53 +00:00

16 KiB

Raw Blame History

aish — Phase 8 Manifest

Project: aish — AI-augmented conversational shell Document: Phase 8 Requirements, Architecture & Design Decisions Status: Formulate (pre-analyze) Date: 2026-05-16

PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest specifies what Phase 8 adds — accurate tokenization: replace the char/4 heuristic on Context:estimate_tokens() with a per-broker /tokenize round-trip where supported, char/4 fallback otherwise.

Resolves Q1 (PHASE0.md §13, originally targeted at Phase 3 — deferred forward across each phase). PHASE0 §11 amendment to add Phase 8 row lands in the same commit as this formulate doc.

1. Scope of Phase 8

Four pillars:

Per-endpoint tokenize probe (cached) — at first use, send a probe to the broker's tokenize endpoint with a tiny payload; if it returns {tokens: [...]} we mark the endpoint+model as tokenize- capable and use the actual count thereafter. If it 404s or errors, mark the slot as tokenize_supported = false and fall through to char/4 silently. Cached per (endpoint, model) for the session.
broker.token_count(model_cfg, text) — thin wrapper that returns an accurate token count when the (endpoint, model) is tokenize-capable, else the char/4 heuristic. Always returns a non-negative integer; never errors. The probe + fallback is transparent to callers.
Context:estimate_tokens() widening — currently char/4 over system_prompt + sum of turn.contents. The new shape accepts an optional tokenize_fn (callback) at Context.new time and uses it when present; falls back to char/4 when nil. repl.lua wires tokenize_fn = function(text) return broker.token_count(active_cfg, text) end. This means the active model's tokenizer is used for budgeting decisions, which matches the broker the next ask_ai will hit.
:cost detail estimated-vs-actual column — for each (model, category) slot in the accumulator, the actual prompt_tokens from broker usage is already stored. Add an estimated column computed via broker.token_count on the currently-buffered prompt-shape. Disagreement >10% surfaces in a tiny ~est=N annotation so users can see when the heuristic diverges from reality. Display-only; no behavior change.

Phase 8 is done when:

A long-running session with the local qwen-coder-7b-snappy-8k model evicts at the RIGHT moment (actual context fills the budget) rather than ~10% late per the baseline gap.
broker.token_count(local_cfg, "hello world") returns 2 (matches the live tokenize result, not the char/4=2 coincidence — verify via :cost detail against multi-paragraph text).
broker.token_count(cloud_cfg, "hello world") returns 2 (char/4 fallback when /tokenize 404s, which it does for OpenRouter).
Cached per-endpoint capability — the probe fires once per endpoint per session, not per call.
Existing configs without cfg.tokenize behave like Phase 7 (zero behavior change unless opted in via cfg.tokenize.use_endpoint = true).
:cost detail shows estimated-vs-actual where disagreement >10%, silent otherwise.

2. Technology Decisions (delta from Phase 7)

Decision	Choice	Rationale
Tokenize endpoint path	`<endpoint>/tokenize` (NOT `<endpoint>/v1/tokenize`)	Per real probe against hossenfelder: `/v1/tokenize` returns 404; `/tokenize` returns `{tokens: [...]}`. This is the llama.cpp server convention.
Request body shape	`{"content": "<text>", "model": "<model>"}`	Local model echoed via `model`; llama.cpp ignores it but harmless. Probed shape works.
Capability detection	Per-call optimistic probe; on 404/non-200, cache `tokenize_supported[endpoint][model] = false` and never retry that session	One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine.
Fallback heuristic	char/4 (Phase 0 §8 convention)	Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available.
`Context:estimate_tokens` calling convention	Optional `tokenize_fn` callback at Context.new; absent = char/4 (existing behavior)	Backward-compatible; no caller break. Opt-in via repl.lua.
Active-model tokenizer	repl.lua wires `tokenize_fn` against `active_cfg` (the currently active model), so eviction decisions match the broker the next call will hit	When the user `:model cloud` switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize).
Caching strategy	Endpoint+model capability flag only; NOT per-text token-count cache	Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint.
Per-text timeout cap	2s for tokenize calls (much tighter than the model's normal timeout_ms)	Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4.
`:cost detail` est-vs-actual	Show only when disagreement >10%; format `(prompt: 558 ~est=508 / completion: 80)` for the disagreement case, `(prompt: 558 / completion: 80)` otherwise	Always-on noise; suppress when heuristic is close.
New config key	`cfg.tokenize = { use_endpoint = true }` — default false until user opts in	Network round-trip cost; user-acknowledged behavior change.

3. Module Changes

File	State after Phase 7	Phase 8 changes
`broker.lua`	`chat`, `chat_stream`, `build_request` (opts-widened in Phase 7)	New `M.token_count(model_cfg, text)`: tries `<endpoint>/tokenize` once per (endpoint, model); caches capability; returns int. New `M.tokenize_supported(model_cfg)` introspection helper for tests.
`context.lua`	`estimate_tokens()` char/4 sum over system_prompt + turn.contents	Widen to use `self.tokenize_fn(text)` if present; else char/4. New `tokenize_fn` field on Context (set at Context.new from opts).
`repl.lua`	wires Context.new with summarize_fn, hosts all metas	`tokenize_fn` wired into Context.new when `cfg.tokenize.use_endpoint = true`. `:cost detail` extended with est-vs-actual column.
`config.lua`	Phase 7 cost block example	Add commented-out `tokenize = { use_endpoint = true }` block.
`docs/PHASE0.md`	§11 lists phases 0-7	Amendment: add Phase 8 row to §11.

No new module files.

4. Pillar 1+2 — `broker.token_count(model_cfg, text)`

-- Per-endpoint capability cache (session-scoped local in broker.lua)
local _tokenize_capable = {}    -- [endpoint .. "/" .. model] = true | false

local function _cache_key(model_cfg)
    return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
end

function M.token_count(model_cfg, text)
    text = text or ""
    if text == "" then return 0 end
    if not (model_cfg and model_cfg.endpoint) then
        return math.floor(#text / 4)   -- pure fallback
    end
    local key = _cache_key(model_cfg)
    local cap = _tokenize_capable[key]
    if cap == false then
        return math.floor(#text / 4)
    end
    -- cap == nil OR cap == true; try the endpoint.
    local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
    local body = json.encode({ content = text, model = model_cfg.model })
    local out, status = curl.post(url, body,
        { "Content-Type: application/json" },
        2000)  -- 2s timeout
    if not (status == 200 and out) then
        _tokenize_capable[key] = false
        return math.floor(#text / 4)
    end
    local doc = json.decode(out)
    local toks = doc and doc.tokens
    if type(toks) ~= "table" then
        _tokenize_capable[key] = false
        return math.floor(#text / 4)
    end
    _tokenize_capable[key] = true
    return #toks
end

function M.tokenize_supported(model_cfg)
    if not model_cfg then return nil end
    return _tokenize_capable[_cache_key(model_cfg)]
end

Uses Phase 1's ffi/curl.M.post (blocking POST, returns body + status).

5. Pillar 3 — `Context:estimate_tokens` widening

function M.new(opts)
    ...
    return setmetatable({
        ...
        -- Phase 8: optional callback that returns an accurate token
        -- count for a given text. Set by repl.lua when cfg.tokenize.
        -- use_endpoint=true, calling broker.token_count(active_cfg, ...).
        -- nil = char/4 fallback (Phase 0 §8 behavior).
        tokenize_fn          = opts.tokenize_fn,
    }, Context)
end

function Context:estimate_tokens()
    if self.tokenize_fn then
        local n = self.tokenize_fn(self.system_prompt)
        for _, t in ipairs(self.turns) do
            n = n + self.tokenize_fn(t.content)
        end
        return n
    end
    -- char/4 fallback (existing behavior)
    local n = #self.system_prompt
    for _, t in ipairs(self.turns) do n = n + #t.content end
    return math.floor(n / 4)
end

Performance note: with N turns of average ~500 chars each, estimate_tokens fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for per-prompt eviction checks. Mitigation: per-turn token count cached ON the turn dict (turn._tokens) the first time it's counted; only re-tokenized if turn._tokens is nil. Set at append time when tokenize_fn is present; otherwise lazily on first estimate.

6. Pillar 4 — `:cost detail` est-vs-actual

Current :cost detail (Phase 7) shows:

  anthropic/claude-haiku-4.5 main                 1 calls,    179 /      8 tokens, $0.000219

The 179 / 8 is prompt_tokens / completion_tokens from the actual broker usage payload.

Phase 8 extension: for each slot, also compute an estimated count via broker.token_count(model_cfg_for_this_slot, ...) over the TURNS THAT CONTRIBUTED to this slot. But that's stateful and expensive — simpler: show the SUM of prompt_tokens (actual) and the SUM of estimate_tokens() (heuristic OR endpoint-based, depending on what tokenize_fn is wired). If disagreement >10%, annotate.

Simplified format:

  anthropic/claude-haiku-4.5 main                 1 calls,    179 ~est=164 / 8 tokens, $0.000219
                                                                    ^^^^^^^^^^ shown when |actual-est|/actual > 0.10

The ~est=N annotation only renders when the disagreement exceeds the 10% threshold. Silent otherwise.

7. UX Surface Summary

Meta	Behavior change
`:cost detail`	Adds `~est=N` annotation per slot when heuristic disagreement >10%
(no new metas in v1)

Config	Default	Effect
`cfg.tokenize.use_endpoint`	false	When true, repl.lua wires `tokenize_fn` so context budgeting uses real token counts

The cfg.tokenize block being opt-in is conservative: enabling it means every Context:estimate_tokens() call may hit the broker. For local llama.cpp the cost is ~50ms; for cloud-only configurations there IS no /tokenize endpoint so we silently fall through to char/4 (cached after one probe). No surprise; document in config example.

8. Out of Scope (Phase 8)

Cost preflight enforcement — option 2 of the Phase 7 §12 candidates. The tokenize work here is a PREREQUISITE for accurate preflight cost estimation, but the enforcement layer itself (cap_at_dollars that REFUSES the call) is its own surface — defer to a separate phase.
Cross-session cost rollup — option 1 of Phase 7 §12 candidates. Independent of tokenization.
Streaming tokenize — some servers expose streaming tokenize endpoints for partial-prompt token counts during generation. Out of scope here; we use the blocking /tokenize for batch estimates.
Multi-tokenizer support (e.g. tiktoken for OpenAI compat, sentencepiece for HuggingFace) — would require vendoring a C library (violates PHASE0 §3) or shelling out to python. Endpoint-based is the only substrate-compliant option for accuracy beyond char/4.
Tokenization for :cost detail rows that span multiple turns — the actual prompt_tokens in the accumulator slot is the sum ACROSS calls; the estimate for comparison should be over the CURRENT ctx content. Show the per-call comparison only.

9. Risks

Risk	Mitigation
`/tokenize` 404 silently cached as `tokenize_supported = false` for a typo'd endpoint config	Per-session cache; restart re-probes. Acceptable.
Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency	`turn._tokens` per-turn cache set at append-time; only re-tokenize on cache miss.
Hossenfelder proxy may forward `/tokenize` differently than direct llama.cpp (e.g., adds `/v1/` prefix expected)	B1 confirms `/tokenize` works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback).
Cloud models without /tokenize emit no probes after first 404 — fine but `:cost detail` est-vs-actual will always agree (both are char/4 then)	Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%.
`Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips	Per-turn cache makes amortized cost O(1) per turn after first count.
Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric	Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not.

10. Open Questions (Phase 8)

#	Question	Impact	Resolution target
Q-T1	Should the per-turn `_tokens` cache survive across `:reset`? `:reset` clears `turns` anyway, so the cache dies with them. But if turns get appended again, do we re-tokenize from scratch?	Cache lifecycle	Analyze (probably trivially yes — new turns get new cache entries)
Q-T2	When the active model changes (`:model cloud`), should the `tokenize_fn` re-bind to the new model's tokenizer? The wiring is set once at Context.new time.	Eviction accuracy after `:model` switch	Analyze (the lambda captures `active_cfg` upval — Lua's closure semantics resolve at call time, so YES it follows :model switches naturally)
Q-T3	Should the probe respect `cfg.tokenize.use_endpoint = false` and skip even the probe? Or always probe and just not USE the result if disabled?	Network behavior with config opt-out	Analyze (skip the probe entirely — opt-out means opt-out)
Q-T4	What's the actual round-trip latency for `/tokenize` against the live broker for typical aish turn sizes (~500 chars)?	Performance model	Baseline
Q-T5	Does hossenfelder's `/tokenize` accept the `model` field, or does it use whichever model is currently loaded?	Multi-model accuracy	Baseline
Q-T6	Should `broker.token_count` also accept a TOOLS-array param so estimates include tool-schema tokens (which the chat_completion sends)?	Eviction accuracy with MCP tools	Analyze

11. Phase 8 → Phase 9+ Out-of-band

Candidate follow-ups (non-binding):

Phase 9: cost preflight enforcement (Phase 7 §12 option 2) — uses Phase 8's accurate token counts to refuse calls that would cross cap_at_dollars. The accuracy work here is the foundation.
Cross-session cost rollup (Phase 7 §12 option 1) — independent; could land in parallel.
Phase X: project-local config overlay (.aish.lua) — was the alternative scope to Phase 7's cost work. Still valuable but independent of any current line.

Phase 8 itself is self-contained — no upstream dependencies.

16 KiB Raw Blame History Unescape Escape