Files
aish/docs/PHASE8.md
T
marfrit aa64ad3eec docs/PHASE8: plan — §13 commit roadmap (5 commits)
Status: Analyze -> Plan. All open Qs resolved (Q-T5 via baseline B1).

5-commit roadmap, bottom-up:

  1. broker.lua — M.token_count helper + per-endpoint capability
     cache. <endpoint>/tokenize probe with 2s timeout; cache true/false
     per (endpoint, model) for the session. char/4 fallback on any
     non-200 / parse-fail / transport err. M.tokenize_supported
     introspection helper.

  2. context.lua — Context.new accepts opts.tokenize_fn; estimate_
     tokens widens to use it when set, with per-turn `_tokens` cache.
     char/4 path unchanged when tokenize_fn nil.

  3. context.lua — enforce_budget consults token_budget too (pillar
     5 from A1). Loop condition: turns>max_turns OR estimate_tokens
     >token_budget. Existing summarize-on-evict callback unchanged.

  4. repl.lua — wire tokenize_fn when cfg.tokenize.use_endpoint=true.
     Closure captures active_cfg upval (A5 — follows :model switches
     naturally). :cost detail extension: trailing line showing
     estimated session ctx tokens for comparison with the per-slot
     prompt_tokens sums in the accumulator.

  5. config.lua commented `tokenize = { use_endpoint = true }`
     example + PHASE8.md status -> Implement.

Per-commit risk index covers: probe latency cap (2s, one-shot),
per-turn cache correctness (immutable post-append), enforce_budget
performance (O(N) per call after cache fill), and the intentional
behavior change of token_budget actually being enforced (sessions
fitting under char/4 may evict earlier under accurate counts —
documented in §9).

Two items open at plan, resolve at implement:
  - exact :cost detail layout for estimated session ctx row
  - whether to add a :tokenize debug meta (defer unless useful in verify)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:24:41 +00:00

27 KiB
Raw Blame History

aish — Phase 8 Manifest

Project: aish — AI-augmented conversational shell Document: Phase 8 Requirements, Architecture & Design Decisions Status: Plan (formulate + analyze + baseline complete; tree at 79bd40d) Date: 2026-05-16

Analyze findings (2026-05-16):

A1. enforce_budget ONLY checks max_turns, not token_budget — major scope gap. Context:enforce_budget (context.lua:319) iterates while #self.turns > self.max_turns; self.token_budget = 4096 is set but NEVER consulted. So even with accurate tokenization, eviction decisions are unaffected — the new estimate_tokens() only feeds the prompt template's {ctx_used} display variable (repl.lua:630).

**Resolution**: extend Phase 8 with a NEW pillar 5: make
`enforce_budget` honor `token_budget` AS WELL AS max_turns —
evict the oldest pair when EITHER threshold is exceeded. This
is the real motivation for accurate tokenization; without it
Phase 8 is largely cosmetic. Folded into §1 (5 pillars now),
§3 (context.lua row), §9 (new risk row about under-eviction
becoming over-eviction if tokenize_fn returns a much higher
number than char/4).

A2. ffi.curl.M.post signature confirmed. (body, status) on success, (nil, err) on failure. Matches the formulate-time sketch. status is the integer HTTP code. The probe checks status == 200 and out correctly.

A3. Single caller of Context:estimate_tokens() in tree. Only repl.lua:630 (prompt template {ctx_used} substitution) calls it. No internal callers in context.lua. This means: - The wiring point is ONE line in repl.lua (the prompt template already runs ctx:estimate_tokens() on every prompt render). - With A1's extension, enforce_budget becomes a SECOND caller — and a more frequent one (per turn, not per prompt render). - Per-turn _tokens cache becomes important for the enforce_budget path (called from ask_ai after every turn).

A4. Q-T1 RESOLVED: per-turn _tokens cache lives on the turn dict. :reset clears ctx.turns so the cache dies with them. New turns get nil _tokens; lazy-set on first count. Trivial.

A5. Q-T2 RESOLVED: tokenize_fn closure captures active_cfg as an UPVALUE. Upvalues are resolved at closure call time, not at definition time. When the user :model cloud switches, active_cfg = config.models[name] reassigns the local; subsequent tokenize_fn calls see the new value. Natural; no explicit re-binding needed.

A6. Q-T3 RESOLVED: skip the probe entirely when cfg.tokenize.use_endpoint = false or unset. Don't even call broker.token_count — repl.lua won't wire tokenize_fn to Context.new in the first place. context.lua's tokenize_fn-nil branch handles it (char/4 fallback).

A7. Q-T6 RESOLVED (defer to follow-up): tools-schema tokens are a fixed cost per session (tools_schema doesn't change unless :mcp connect/disconnect lands a new session). The under-count is bounded and predictable. Defer to a future polish; v1 counts only messages. Document in §8 out-of-scope.

A8. Per-turn _tokens cache invalidation. Turn content is immutable after append (we don't mutate stored turns). Cache is safe to live forever on the turn. The only invalidation event is :reset (clears turns wholesale). No other invalidation needed.

A9. Probe latency baseline (Q-T4 deferred): probed manually during formulate — single tokenize call for ~50 char text ran in ~50ms locally. For 40 turns × 500 chars cached = 40 × 50ms = 2s ONLY on the first estimate after a fresh session. After caching, subsequent estimates are O(1) per turn (dict lookups).

A10. Streaming during chat_stream interleaves with tokenize? No — Context:estimate_tokens() is called OUTSIDE the streaming callback (in the main loop, before/after broker calls). No concurrent network competition.

A11. MCP tool turn contentrole:"tool" turns have content strings too (the tool result). These get tokenized identically; no special-case needed. Cache key is the turn dict itself, so tool turns get their own _tokens slot.

A12. include_usage interaction with tokenize: orthogonal. The tokenize probe uses a separate (non-streaming) /tokenize endpoint; never sees the chat completion's stream_options.

PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest specifies what Phase 8 adds — accurate tokenization: replace the char/4 heuristic on Context:estimate_tokens() with a per-broker /tokenize round-trip where supported, char/4 fallback otherwise.

Resolves Q1 (PHASE0.md §13, originally targeted at Phase 3 — deferred forward across each phase). PHASE0 §11 amendment to add Phase 8 row lands in the same commit as this formulate doc.


1. Scope of Phase 8

Five pillars (A1 added pillar 5):

  1. Per-endpoint tokenize probe (cached) — at first use, send a probe to the broker's tokenize endpoint with a tiny payload; if it returns {tokens: [...]} we mark the endpoint+model as tokenize- capable and use the actual count thereafter. If it 404s or errors, mark the slot as tokenize_supported = false and fall through to char/4 silently. Cached per (endpoint, model) for the session.

  2. broker.token_count(model_cfg, text) — thin wrapper that returns an accurate token count when the (endpoint, model) is tokenize-capable, else the char/4 heuristic. Always returns a non-negative integer; never errors. The probe + fallback is transparent to callers.

  3. Context:estimate_tokens() widening — currently char/4 over system_prompt + sum of turn.contents. The new shape accepts an optional tokenize_fn (callback) at Context.new time and uses it when present; falls back to char/4 when nil. repl.lua wires tokenize_fn = function(text) return broker.token_count(active_cfg, text) end. This means the active model's tokenizer is used for budgeting decisions, which matches the broker the next ask_ai will hit.

  4. :cost detail estimated-vs-actual column — for each (model, category) slot in the accumulator, the actual prompt_tokens from broker usage is already stored. Add an estimated column computed via broker.token_count on the currently-buffered prompt-shape. Disagreement >10% surfaces in a tiny ~est=N annotation so users can see when the heuristic diverges from reality. Display-only; no behavior change.

  5. enforce_budget consults token_budget (A1) — currently enforce_budget only iterates #turns > max_turns. Extend to ALSO check estimate_tokens() > token_budget. Eviction fires when EITHER threshold is exceeded; the existing summarize-on- evict callback (Phase 5) still gets called per evicted pair. This is the real motivation for accurate tokenization — without it, the new token counts are display-only. Default budget (token_budget = 4096) was set at PHASE0 but never enforced; Phase 8 closes that gap.

Phase 8 is done when:

  • A long-running session with the local qwen-coder-7b-snappy-8k model evicts at the RIGHT moment (token_budget=4096 hit triggers eviction via the new pillar 5 path) rather than only when max_turns is exceeded.
  • broker.token_count(local_cfg, "hello world") returns 2 (matches the live tokenize result, not the char/4=2 coincidence — verify via :cost detail against multi-paragraph text).
  • broker.token_count(cloud_cfg, "hello world") returns 2 (char/4 fallback when /tokenize 404s, which it does for OpenRouter).
  • Cached per-endpoint capability — the probe fires once per endpoint per session, not per call.
  • Existing configs without cfg.tokenize behave like Phase 7 (zero behavior change unless opted in via cfg.tokenize.use_endpoint = true).
  • :cost detail shows estimated-vs-actual where disagreement >10%, silent otherwise.

2. Technology Decisions (delta from Phase 7)

Decision Choice Rationale
Tokenize endpoint path <endpoint>/tokenize (NOT <endpoint>/v1/tokenize) Per real probe against hossenfelder: /v1/tokenize returns 404; /tokenize returns {tokens: [...]}. This is the llama.cpp server convention.
Request body shape {"content": "<text>", "model": "<model>"} Local model echoed via model; llama.cpp ignores it but harmless. Probed shape works.
Capability detection Per-call optimistic probe; on 404/non-200, cache tokenize_supported[endpoint][model] = false and never retry that session One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine.
Fallback heuristic char/4 (Phase 0 §8 convention) Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available.
Context:estimate_tokens calling convention Optional tokenize_fn callback at Context.new; absent = char/4 (existing behavior) Backward-compatible; no caller break. Opt-in via repl.lua.
Active-model tokenizer repl.lua wires tokenize_fn against active_cfg (the currently active model), so eviction decisions match the broker the next call will hit When the user :model cloud switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize).
Caching strategy Endpoint+model capability flag only; NOT per-text token-count cache Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint.
Per-text timeout cap 2s for tokenize calls (much tighter than the model's normal timeout_ms) Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4.
:cost detail est-vs-actual Show only when disagreement >10%; format (prompt: 558 ~est=508 / completion: 80) for the disagreement case, (prompt: 558 / completion: 80) otherwise Always-on noise; suppress when heuristic is close.
New config key cfg.tokenize = { use_endpoint = true } — default false until user opts in Network round-trip cost; user-acknowledged behavior change.

3. Module Changes

File State after Phase 7 Phase 8 changes
broker.lua chat, chat_stream, build_request (opts-widened in Phase 7) New M.token_count(model_cfg, text): tries <endpoint>/tokenize once per (endpoint, model); caches capability; returns int. New M.tokenize_supported(model_cfg) introspection helper for tests.
context.lua estimate_tokens() char/4 sum over system_prompt + turn.contents; enforce_budget() only checks max_turns Widen estimate_tokens to use self.tokenize_fn(text) if present; else char/4. Per-turn _tokens cache on each turn dict; lazy-set on first count. Extend enforce_budget to ALSO evict when estimate_tokens() > token_budget (A1 — pillar 5).
repl.lua wires Context.new with summarize_fn, hosts all metas tokenize_fn wired into Context.new when cfg.tokenize.use_endpoint = true. :cost detail extended with est-vs-actual column.
config.lua Phase 7 cost block example Add commented-out tokenize = { use_endpoint = true } block.
docs/PHASE0.md §11 lists phases 0-7 Amendment: add Phase 8 row to §11.

No new module files.


4. Pillar 1+2 — broker.token_count(model_cfg, text)

-- Per-endpoint capability cache (session-scoped local in broker.lua)
local _tokenize_capable = {}    -- [endpoint .. "/" .. model] = true | false

local function _cache_key(model_cfg)
    return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
end

function M.token_count(model_cfg, text)
    text = text or ""
    if text == "" then return 0 end
    if not (model_cfg and model_cfg.endpoint) then
        return math.floor(#text / 4)   -- pure fallback
    end
    local key = _cache_key(model_cfg)
    local cap = _tokenize_capable[key]
    if cap == false then
        return math.floor(#text / 4)
    end
    -- cap == nil OR cap == true; try the endpoint.
    local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
    local body = json.encode({ content = text, model = model_cfg.model })
    local out, status = curl.post(url, body,
        { "Content-Type: application/json" },
        2000)  -- 2s timeout
    if not (status == 200 and out) then
        _tokenize_capable[key] = false
        return math.floor(#text / 4)
    end
    local doc = json.decode(out)
    local toks = doc and doc.tokens
    if type(toks) ~= "table" then
        _tokenize_capable[key] = false
        return math.floor(#text / 4)
    end
    _tokenize_capable[key] = true
    return #toks
end

function M.tokenize_supported(model_cfg)
    if not model_cfg then return nil end
    return _tokenize_capable[_cache_key(model_cfg)]
end

Uses Phase 1's ffi/curl.M.post (blocking POST, returns body + status).


5. Pillar 3 — Context:estimate_tokens widening

function M.new(opts)
    ...
    return setmetatable({
        ...
        -- Phase 8: optional callback that returns an accurate token
        -- count for a given text. Set by repl.lua when cfg.tokenize.
        -- use_endpoint=true, calling broker.token_count(active_cfg, ...).
        -- nil = char/4 fallback (Phase 0 §8 behavior).
        tokenize_fn          = opts.tokenize_fn,
    }, Context)
end

function Context:estimate_tokens()
    if self.tokenize_fn then
        local n = self.tokenize_fn(self.system_prompt)
        for _, t in ipairs(self.turns) do
            n = n + self.tokenize_fn(t.content)
        end
        return n
    end
    -- char/4 fallback (existing behavior)
    local n = #self.system_prompt
    for _, t in ipairs(self.turns) do n = n + #t.content end
    return math.floor(n / 4)
end

Performance note: with N turns of average ~500 chars each, estimate_tokens fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for per-prompt eviction checks. Mitigation: per-turn token count cached ON the turn dict (turn._tokens) the first time it's counted; only re-tokenized if turn._tokens is nil. Set at append time when tokenize_fn is present; otherwise lazily on first estimate.


6. Pillar 4 — :cost detail est-vs-actual

Current :cost detail (Phase 7) shows:

  anthropic/claude-haiku-4.5 main                 1 calls,    179 /      8 tokens, $0.000219

The 179 / 8 is prompt_tokens / completion_tokens from the actual broker usage payload.

Phase 8 extension: for each slot, also compute an estimated count via broker.token_count(model_cfg_for_this_slot, ...) over the TURNS THAT CONTRIBUTED to this slot. But that's stateful and expensive — simpler: show the SUM of prompt_tokens (actual) and the SUM of estimate_tokens() (heuristic OR endpoint-based, depending on what tokenize_fn is wired). If disagreement >10%, annotate.

Simplified format:

  anthropic/claude-haiku-4.5 main                 1 calls,    179 ~est=164 / 8 tokens, $0.000219
                                                                    ^^^^^^^^^^ shown when |actual-est|/actual > 0.10

The ~est=N annotation only renders when the disagreement exceeds the 10% threshold. Silent otherwise.


7. UX Surface Summary

Meta Behavior change
:cost detail Adds ~est=N annotation per slot when heuristic disagreement >10%
(no new metas in v1)
Config Default Effect
cfg.tokenize.use_endpoint false When true, repl.lua wires tokenize_fn so context budgeting uses real token counts

The cfg.tokenize block being opt-in is conservative: enabling it means every Context:estimate_tokens() call may hit the broker. For local llama.cpp the cost is ~50ms; for cloud-only configurations there IS no /tokenize endpoint so we silently fall through to char/4 (cached after one probe). No surprise; document in config example.


8. Out of Scope (Phase 8)

  • Cost preflight enforcement — option 2 of the Phase 7 §12 candidates. The tokenize work here is a PREREQUISITE for accurate preflight cost estimation, but the enforcement layer itself (cap_at_dollars that REFUSES the call) is its own surface — defer to a separate phase.
  • Cross-session cost rollup — option 1 of Phase 7 §12 candidates. Independent of tokenization.
  • Streaming tokenize — some servers expose streaming tokenize endpoints for partial-prompt token counts during generation. Out of scope here; we use the blocking /tokenize for batch estimates.
  • Multi-tokenizer support (e.g. tiktoken for OpenAI compat, sentencepiece for HuggingFace) — would require vendoring a C library (violates PHASE0 §3) or shelling out to python. Endpoint-based is the only substrate-compliant option for accuracy beyond char/4.
  • Tokenization for :cost detail rows that span multiple turns — the actual prompt_tokens in the accumulator slot is the sum ACROSS calls; the estimate for comparison should be over the CURRENT ctx content. Show the per-call comparison only.

9. Risks

Risk Mitigation
/tokenize 404 silently cached as tokenize_supported = false for a typo'd endpoint config Per-session cache; restart re-probes. Acceptable.
Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency turn._tokens per-turn cache set at append-time; only re-tokenize on cache miss.
Hossenfelder proxy may forward /tokenize differently than direct llama.cpp (e.g., adds /v1/ prefix expected) B1 confirms /tokenize works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback).
Cloud models without /tokenize emit no probes after first 404 — fine but :cost detail est-vs-actual will always agree (both are char/4 then) Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%.
Context:estimate_tokens callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips Per-turn cache makes amortized cost O(1) per turn after first count.
Endpoint URL handling — currently endpoint .. "/v1/chat/completions" is hardcoded; tokenize uses endpoint .. "/tokenize" (no /v1) — asymmetric Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not.
A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. Default token_budget = 4096 was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on cfg.context.token_budget defaults may see eviction earlier than before — document as intentional. Users can raise token_budget per their model's real context window.

10. Open Questions (Phase 8)

# Question Impact Resolution target
Q-T1 Per-turn _tokens cache across :reset A4 — dies with turns; new turns get nil and lazy-set on first count. Trivial.
Q-T2 tokenize_fn re-bind on :model switch A5 — closure captures active_cfg upvalue; resolved at call time; follows :model switch naturally. No explicit re-binding needed.
Q-T3 Probe respects opt-out A6 — when cfg.tokenize.use_endpoint = false, repl.lua doesn't wire tokenize_fn; context.lua's nil branch takes the char/4 fallback. No probe call at all.
Q-T4 Tokenize round-trip latency A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count.
Q-T5 /tokenize honors model field Baseline (not yet probed in detail — assumed yes since llama.cpp echoes model in completions; needs explicit multi-model probe).
Q-T6 tools-schema tokens A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only.

11. Phase 8 → Phase 9+ Out-of-band

Candidate follow-ups (non-binding):

  • Phase 9: cost preflight enforcement (Phase 7 §12 option 2) — uses Phase 8's accurate token counts to refuse calls that would cross cap_at_dollars. The accuracy work here is the foundation.
  • Cross-session cost rollup (Phase 7 §12 option 1) — independent; could land in parallel.
  • Phase X: project-local config overlay (.aish.lua) — was the alternative scope to Phase 7's cost work. Still valuable but independent of any current line.

Phase 8 itself is self-contained — no upstream dependencies.


13. Implementation Plan (commit-by-commit)

Bottom-up: broker first (the egress capability all callers depend on), then context (the consumer + the new pillar 5 budget extension), then repl.lua wiring + display, then config + status bump. Each commit leaves the tree green (existing tests + load smoke + per- commit feature smoke).

Order

  1. broker.luaM.token_count helper + per-endpoint capability cache.

    • Module-local _tokenize_capable table keyed by endpoint .. "/" .. model.
    • M.token_count(model_cfg, text):
      • empty text -> 0
      • bad cfg (no endpoint) -> char/4 immediately
      • capability cache says false for this slot -> char/4
      • otherwise: probe <endpoint>/tokenize with {content, model} body, 2s timeout. On status == 200 + parseable {tokens=[...]}: cache true, return #tokens. Anything else (non-200, parse fail, transport err): cache false, char/4.
    • M.tokenize_supported(model_cfg) returns the cache slot for introspection (tests + future :tokenize meta).
    • Smoke: hand-call M.token_count(local_cfg, "hello world") -> 2; M.token_count(cloud_cfg, "hello world") -> 2 (char/4 fallback; cache marks cloud as unsupported on first try).
  2. context.lua — estimate_tokens widening + per-turn cache.

    • Context.new accepts opts.tokenize_fn -> stored as self.tokenize_fn.
    • Context:estimate_tokens():
      • if tokenize_fn is nil: existing char/4 (no behavior change).
      • else: tokenize system_prompt (no caching — system prompt changes per turn due to dynamic blocks). For each turn: if turn._tokens is set use it; else compute via tokenize_fn AND cache on turn._tokens.
    • No new helper; the change is internal to estimate_tokens.
    • Smoke: synthetic Context with stub tokenize_fn that returns N=42 for every call; verify estimate sums correctly + cache populates turn._tokens.
  3. context.lua — enforce_budget honors token_budget (pillar 5).

    • Existing while #self.turns > self.max_turns loop extended to: while #self.turns > self.max_turns OR self:estimate_tokens() > self.token_budget do.
    • Per-pair eviction otherwise unchanged (summarize callback, status_evictions).
    • The estimate_tokens call inside the loop is potentially expensive under tokenize_fn — but commit #2's per-turn cache means each iteration is O(#turns) dict-lookups. Acceptable for the eviction hot path.
    • Smoke: Context with token_budget = 100, max_turns = 100, fill with turns until estimate_tokens() > 100, then call enforce_budget — should evict until under budget. With char/4: eviction happens at known char counts. With tokenize_fn stub returning 50/turn: evicts down to <= 100/50 = 2 turns.
  4. repl.lua — tokenize_fn wiring + :cost detail est-vs-actual.

    • When config.tokenize and config.tokenize.use_endpoint, build ctx_opts.tokenize_fn = function(text) return broker.token_count(active_cfg, text) end. The closure captures active_cfg upval per A5 — follows :model switches naturally.
    • :cost detail extension: per slot, compute the sum of actual prompt_tokens (already in usage_totals) AND an "estimated" count via... actually this needs a rethink: the accumulator doesn't store per-turn data, just running sums. The estimate for comparison would need to be over ctx's CURRENT content (a single snapshot, not per-call). Simplest: show a SECOND line under :cost detail with the current estimate_tokens() value, labeled "[estimated session ctx: N tokens]". Disagreement with the accumulator's prompt_tokens TOTAL surfaced inline.
    • Smoke: with use_endpoint=true on a local-only session, observe enforce_budget eviction timing vs disable; observe :cost detail estimate row.
  5. config.lua example block + docs/PHASE8.md status bump.

    • Commented-out tokenize = { use_endpoint = true } block in config.lua with parity to Phase 1-7 example blocks. Document the per-endpoint network cost (one probe per session) and the implication: token_budget actually enforces now.
    • PHASE8.md status header -> Implement.

Risk index per commit

Commit Risk Mitigation
1 (broker) Per-endpoint cache leaks across model_cfg deletions (e.g., user removes a model from config mid-session) Cache is keyed by string; stale entries don't grow without bound (bounded by #configured models × 1). No GC needed.
1 (broker) /tokenize probe blocks the calling thread for 2s on a misconfigured endpoint 2s timeout is the cap; one-shot per endpoint per session.
2 (context) per-turn _tokens cache miss on every estimate when no tokenize_fn -> existing perf preserved Cache check is conditional on tokenize_fn presence; char/4 path untouched.
3 (context) enforce_budget loop now calls estimate_tokens potentially every iteration; with tokenize_fn that's O(#turns) per iteration -> O(#turns^2) worst case Per-turn cache makes this O(#turns) amortized after first fill. For typical max_turns=40 + token_budget=4096 sessions: ~40^2 dict lookups = 1600 ops in worst case, microsecond cost.
3 (context) accurate counts mean token_budget=4096 (Phase 0 default) finally ENFORCES — sessions that fit under char/4 may now evict earlier Documented in §9; user can raise token_budget to match their model's real context window.
4 (repl) tokenize_fn closure binding to active_cfg upval — if upval somehow gets reassigned wrong, eviction uses wrong tokenizer Lua upvalues are call-time-resolved; A5 verified. Test by smoke after :model switch.
5 (config + status) none

Tests + smoke per commit

Each commit:

  • Pass luajit test_safety.lua (87/87) and luajit test_router_model.lua (31/31)
  • Load cleanly via luajit -e 'package.path=...; require("repl"); print("ok")'
  • Pass a per-feature smoke (described in each row above)

Things deliberately NOT split

  • New module file for tokenize — small enough to live in broker.lua.
  • Per-text token cache (in addition to per-turn): not needed; turn content is immutable post-append.
  • :tokenize meta for introspecting the cache — M.tokenize_supported is exported for testing; if a user needs runtime visibility, that's a follow-up.

Open at plan-time (resolve at implement)

  • :cost detail layout — how exactly to show "estimated session ctx" relative to the existing per-slot rows. Pick at commit 4 (likely a single trailing line under the detail table).
  • Whether to expose :tokenize <text> for direct-probe debugging. Nice-to-have; defer unless useful during verify.