config.lua:
- Commented-out `tokenize = { use_endpoint = true }` block with
parity to the Phase 1-7 example blocks.
- Documents the two consequences: (1) per-turn network cost
(~30ms first time, cached after) and (2) token_budget is now
actually enforced — sessions that fit under char/4 may evict
earlier under accurate counts.
- Notes cloud /tokenize 404 fallback path.
docs/PHASE8.md:
- Status header bumped: "Plan + review fold-in" -> "Implement"
- Lists the 5 implement commits inline for traceability:
7ef2a6e broker: token_count + endpoint cache
8502517 context: tokenize_fn + _tokens cache
db26d0c context: enforce_budget honors token_budget (R2 guard)
94b7d86 repl: wire tokenize_fn + :cost detail estimate row
this config example + status bump
Phase 8 implementation is complete. Resolves Q1 (PHASE0 §13,
originally Phase 3, deferred forward). Next inner-loop step: verify
(7) — file test cases, run autonomous, close. Then memory-update (8).
Regression: test_safety 87/87, test_router_model 31/31, repl loads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
33 KiB
aish — Phase 8 Manifest
Project: aish — AI-augmented conversational shell
Document: Phase 8 Requirements, Architecture & Design Decisions
Status: Implement (5 commits landed: 7ef2a6e, 8502517, db26d0c, 94b7d86, this)
Date: 2026-05-16
Review findings (independent Sonnet agent, 2026-05-16) — 2 BLOCKERs resolved in-place, 4 CONCERNs folded, 4 NITs applied:
R1 (BLOCKER, RESOLVED). §5 pseudocode missing per-turn cache pattern.
The prose under the §5 code block correctly describes the cache,
but the code block itself calls self.tokenize_fn(t.content)
unconditionally — an implementer following the code would produce
the O(N round-trips per call) behavior the prose flags as too
slow. Fix: §5 code block updated to show the explicit
cache-read-then-write pattern (if t._tokens then ... else t._tokens = self.tokenize_fn(t.content) end). §13 commit 2 row
also calls this out explicitly.
R2 (BLOCKER, RESOLVED). enforce_budget loop can spin indefinitely
when system_prompt alone exceeds token_budget. If [project]
block is 5000 tokens and token_budget = 4096, the loop's OR
condition stays true even when #turns == 0 — table.remove
is a no-op, the loop never exits. Fix: §13 commit 3 row
updated to specify the explicit guard: while (#self.turns > self.max_turns or self:estimate_tokens() > self.token_budget) and #self.turns > 0 do. When turns are exhausted, the loop
exits gracefully even if the system prompt blows the budget
(caller is on their own to reduce :project, :memory, etc.).
R3 (CONCERN, FOLDED). :cost detail comparison semantically
undefined. Sum-of-prompt_tokens across all calls (accumulator)
vs current-snapshot estimate are incommensurable — sessions
with evictions ALWAYS show divergence not because heuristic is
wrong but because they measure different things. Resolution:
§6 reworked to drop the per-slot ~est=N inline annotation
(which conflated the two); instead show a SINGLE trailing
"[estimated session ctx: N tokens]" line under :cost detail.
Cleanly separates the running-total accumulator from the
current-snapshot estimate. §13 commit 4 already pointed to this
direction — now §6 matches.
R4 (CONCERN, FOLDED). tokenize_fn closure must reference active_cfg
by upvalue, not by value-capture. If the implementer writes
local cfg = active_cfg; return function(text) ... cfg ... end,
the closure won't follow :model switches. Fix: §13 commit
4 row gains an explicit code note: the closure MUST be
function(text) return broker.token_count(active_cfg, text) end
— direct upvalue reference. A5 analysis verified upvalue
semantics; now spelled out so the implementer doesn't subtly
miss it.
R5 (CONCERN, FOLDED). 2s tokenize timeout can spuriously cache as
unsupported when llama.cpp is busy serving a concurrent
completion. llama.cpp is single-threaded for inference; a
/tokenize request that arrives mid-generation queues behind
inference and may exceed the 2s cap. The capability would then
cache as false for the rest of the session, even though the
endpoint IS capable. Fix: §9 risk row added documenting
this. Mitigation: 2s is reasonable for IDLE responses but if
practical problems surface, bump to 5s or make configurable
(cfg.tokenize.timeout_ms). v1 ships 2s; revisit in verify if
it bites.
R6 (CONCERN, FOLDED). Per-endpoint cache key conflates two
same-endpoint/different-model presets. B1 confirmed
/tokenize ignores the model field, so two probes per session
when one would suffice. Fix: §4 cache key SIMPLIFIED to
just model_cfg.endpoint (B1-justified). Same-endpoint
presets share one cache entry; one probe per endpoint per
session, not per (endpoint, model). For a future broker that
DOES honor the model field, this design choice would need
revisiting — documented inline.
R-N1..N4 (NITs, APPLIED):
N1. §13 commit 3 condition uses uppercase OR/AND — corrected
to Lua's lowercase or/and.
N2. §10 Q-T5 row's "Resolution target" cell was empty; now reads
"Baseline (B1)" for consistency.
N3. §6 outdated inline ~est=N description removed; new approach
(single trailing summary line) is documented; §8 out-of-scope
bullet about per-call comparison stays as the explicit "we
considered, rejected" record.
N4. PHASE8.md status header (formerly carrying a stale tree hash
that would drift before implementation) now references the
latest tree as of this fold-in (aa64ad3). Commit 5's status
bump to "Implement" will refresh it again at that point.
Analyze findings (2026-05-16):
A1. enforce_budget ONLY checks max_turns, not token_budget — major
scope gap. Context:enforce_budget (context.lua:319) iterates
while #self.turns > self.max_turns; self.token_budget = 4096
is set but NEVER consulted. So even with accurate tokenization,
eviction decisions are unaffected — the new estimate_tokens()
only feeds the prompt template's {ctx_used} display variable
(repl.lua:630).
**Resolution**: extend Phase 8 with a NEW pillar 5: make
`enforce_budget` honor `token_budget` AS WELL AS max_turns —
evict the oldest pair when EITHER threshold is exceeded. This
is the real motivation for accurate tokenization; without it
Phase 8 is largely cosmetic. Folded into §1 (5 pillars now),
§3 (context.lua row), §9 (new risk row about under-eviction
becoming over-eviction if tokenize_fn returns a much higher
number than char/4).
A2. ffi.curl.M.post signature confirmed. (body, status) on
success, (nil, err) on failure. Matches the formulate-time
sketch. status is the integer HTTP code. The probe checks
status == 200 and out correctly.
A3. Single caller of Context:estimate_tokens() in tree. Only
repl.lua:630 (prompt template {ctx_used} substitution) calls
it. No internal callers in context.lua. This means:
- The wiring point is ONE line in repl.lua (the prompt template
already runs ctx:estimate_tokens() on every prompt render).
- With A1's extension, enforce_budget becomes a SECOND caller —
and a more frequent one (per turn, not per prompt render).
- Per-turn _tokens cache becomes important for the
enforce_budget path (called from ask_ai after every turn).
A4. Q-T1 RESOLVED: per-turn _tokens cache lives on the turn
dict. :reset clears ctx.turns so the cache dies with them.
New turns get nil _tokens; lazy-set on first count. Trivial.
A5. Q-T2 RESOLVED: tokenize_fn closure captures active_cfg as
an UPVALUE. Upvalues are resolved at closure call time, not at
definition time. When the user :model cloud switches,
active_cfg = config.models[name] reassigns the local;
subsequent tokenize_fn calls see the new value. Natural; no
explicit re-binding needed.
A6. Q-T3 RESOLVED: skip the probe entirely when
cfg.tokenize.use_endpoint = false or unset. Don't even call
broker.token_count — repl.lua won't wire tokenize_fn to
Context.new in the first place. context.lua's tokenize_fn-nil
branch handles it (char/4 fallback).
A7. Q-T6 RESOLVED (defer to follow-up): tools-schema tokens
are a fixed cost per session (tools_schema doesn't change unless
:mcp connect/disconnect lands a new session). The under-count
is bounded and predictable. Defer to a future polish; v1
counts only messages. Document in §8 out-of-scope.
A8. Per-turn _tokens cache invalidation. Turn content is
immutable after append (we don't mutate stored turns). Cache is
safe to live forever on the turn. The only invalidation event
is :reset (clears turns wholesale). No other invalidation
needed.
A9. Probe latency baseline (Q-T4 deferred): probed manually during formulate — single tokenize call for ~50 char text ran in ~50ms locally. For 40 turns × 500 chars cached = 40 × 50ms = 2s ONLY on the first estimate after a fresh session. After caching, subsequent estimates are O(1) per turn (dict lookups).
A10. Streaming during chat_stream interleaves with tokenize?
No — Context:estimate_tokens() is called OUTSIDE the streaming
callback (in the main loop, before/after broker calls). No
concurrent network competition.
A11. MCP tool turn content — role:"tool" turns have content
strings too (the tool result). These get tokenized identically;
no special-case needed. Cache key is the turn dict itself, so
tool turns get their own _tokens slot.
A12. include_usage interaction with tokenize: orthogonal. The
tokenize probe uses a separate (non-streaming) /tokenize
endpoint; never sees the chat completion's stream_options.
PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest
specifies what Phase 8 adds — accurate tokenization: replace the
char/4 heuristic on Context:estimate_tokens() with a per-broker
/tokenize round-trip where supported, char/4 fallback otherwise.
Resolves Q1 (PHASE0.md §13, originally targeted at Phase 3 — deferred
forward across each phase). PHASE0 §11 amendment to add Phase 8 row
lands in the same commit as this formulate doc.
1. Scope of Phase 8
Five pillars (A1 added pillar 5):
-
Per-endpoint tokenize probe (cached) — at first use, send a probe to the broker's tokenize endpoint with a tiny payload; if it returns
{tokens: [...]}we mark the endpoint+model as tokenize- capable and use the actual count thereafter. If it 404s or errors, mark the slot astokenize_supported = falseand fall through to char/4 silently. Cached per(endpoint, model)for the session. -
broker.token_count(model_cfg, text)— thin wrapper that returns an accurate token count when the (endpoint, model) is tokenize-capable, else the char/4 heuristic. Always returns a non-negative integer; never errors. The probe + fallback is transparent to callers. -
Context:estimate_tokens()widening — currently char/4 oversystem_prompt+ sum ofturn.contents. The new shape accepts an optionaltokenize_fn(callback) atContext.newtime and uses it when present; falls back to char/4 when nil.repl.luawirestokenize_fn = function(text) return broker.token_count(active_cfg, text) end. This means the active model's tokenizer is used for budgeting decisions, which matches the broker the next ask_ai will hit. -
:cost detailestimated-vs-actual column — for each (model, category) slot in the accumulator, the actualprompt_tokensfrom broker usage is already stored. Add an estimated column computed viabroker.token_counton the currently-buffered prompt-shape. Disagreement >10% surfaces in a tiny~est=Nannotation so users can see when the heuristic diverges from reality. Display-only; no behavior change. -
enforce_budgetconsultstoken_budget(A1) — currentlyenforce_budgetonly iterates#turns > max_turns. Extend to ALSO checkestimate_tokens() > token_budget. Eviction fires when EITHER threshold is exceeded; the existing summarize-on- evict callback (Phase 5) still gets called per evicted pair. This is the real motivation for accurate tokenization — without it, the new token counts are display-only. Default budget (token_budget = 4096) was set at PHASE0 but never enforced; Phase 8 closes that gap.
Phase 8 is done when:
- A long-running session with the local
qwen-coder-7b-snappy-8kmodel evicts at the RIGHT moment (token_budget=4096 hit triggers eviction via the new pillar 5 path) rather than only when max_turns is exceeded. broker.token_count(local_cfg, "hello world")returns 2 (matches the live tokenize result, not the char/4=2 coincidence — verify via:cost detailagainst multi-paragraph text).broker.token_count(cloud_cfg, "hello world")returns 2 (char/4 fallback when /tokenize 404s, which it does for OpenRouter).- Cached per-endpoint capability — the probe fires once per endpoint per session, not per call.
- Existing configs without
cfg.tokenizebehave like Phase 7 (zero behavior change unless opted in viacfg.tokenize.use_endpoint = true). :cost detailshows estimated-vs-actual where disagreement >10%, silent otherwise.
2. Technology Decisions (delta from Phase 7)
| Decision | Choice | Rationale |
|---|---|---|
| Tokenize endpoint path | <endpoint>/tokenize (NOT <endpoint>/v1/tokenize) |
Per real probe against hossenfelder: /v1/tokenize returns 404; /tokenize returns {tokens: [...]}. This is the llama.cpp server convention. |
| Request body shape | {"content": "<text>", "model": "<model>"} |
Local model echoed via model; llama.cpp ignores it but harmless. Probed shape works. |
| Capability detection | Per-call optimistic probe; on 404/non-200, cache tokenize_supported[endpoint][model] = false and never retry that session |
One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine. |
| Fallback heuristic | char/4 (Phase 0 §8 convention) | Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available. |
Context:estimate_tokens calling convention |
Optional tokenize_fn callback at Context.new; absent = char/4 (existing behavior) |
Backward-compatible; no caller break. Opt-in via repl.lua. |
| Active-model tokenizer | repl.lua wires tokenize_fn against active_cfg (the currently active model), so eviction decisions match the broker the next call will hit |
When the user :model cloud switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize). |
| Caching strategy | Endpoint+model capability flag only; NOT per-text token-count cache | Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint. |
| Per-text timeout cap | 2s for tokenize calls (much tighter than the model's normal timeout_ms) | Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4. |
:cost detail est-vs-actual |
Show only when disagreement >10%; format (prompt: 558 ~est=508 / completion: 80) for the disagreement case, (prompt: 558 / completion: 80) otherwise |
Always-on noise; suppress when heuristic is close. |
| New config key | cfg.tokenize = { use_endpoint = true } — default false until user opts in |
Network round-trip cost; user-acknowledged behavior change. |
3. Module Changes
| File | State after Phase 7 | Phase 8 changes |
|---|---|---|
broker.lua |
chat, chat_stream, build_request (opts-widened in Phase 7) |
New M.token_count(model_cfg, text): tries <endpoint>/tokenize once per (endpoint, model); caches capability; returns int. New M.tokenize_supported(model_cfg) introspection helper for tests. |
context.lua |
estimate_tokens() char/4 sum over system_prompt + turn.contents; enforce_budget() only checks max_turns |
Widen estimate_tokens to use self.tokenize_fn(text) if present; else char/4. Per-turn _tokens cache on each turn dict; lazy-set on first count. Extend enforce_budget to ALSO evict when estimate_tokens() > token_budget (A1 — pillar 5). |
repl.lua |
wires Context.new with summarize_fn, hosts all metas | tokenize_fn wired into Context.new when cfg.tokenize.use_endpoint = true. :cost detail extended with est-vs-actual column. |
config.lua |
Phase 7 cost block example | Add commented-out tokenize = { use_endpoint = true } block. |
docs/PHASE0.md |
§11 lists phases 0-7 | Amendment: add Phase 8 row to §11. |
No new module files.
4. Pillar 1+2 — broker.token_count(model_cfg, text)
R6-revised — cache key is endpoint-only (B1: /tokenize ignores the model field so two presets sharing an endpoint share one cache entry):
-- Per-endpoint capability cache (session-scoped local in broker.lua).
-- Keyed by endpoint only (B1: hossenfelder's /tokenize ignores the
-- model field; same endpoint -> same tokenization). If a future
-- broker honors the model field, revisit this keying.
local _tokenize_capable = {} -- [endpoint] = true | false
function M.token_count(model_cfg, text)
text = text or ""
if text == "" then return 0 end
if not (model_cfg and model_cfg.endpoint) then
return math.floor(#text / 4) -- pure fallback
end
local ep = model_cfg.endpoint
local cap = _tokenize_capable[ep]
if cap == false then
return math.floor(#text / 4)
end
-- cap == nil OR cap == true; try the endpoint.
local url = ep:gsub("/+$", "") .. "/tokenize"
local body = json.encode({ content = text, model = model_cfg.model })
local out, status = curl.post(url, body,
{ "Content-Type: application/json" },
2000) -- 2s timeout
if not (status == 200 and out) then
_tokenize_capable[ep] = false
return math.floor(#text / 4)
end
local doc = json.decode(out)
local toks = doc and doc.tokens
if type(toks) ~= "table" then
_tokenize_capable[ep] = false
return math.floor(#text / 4)
end
_tokenize_capable[ep] = true
return #toks
end
function M.tokenize_supported(model_cfg)
if not (model_cfg and model_cfg.endpoint) then return nil end
return _tokenize_capable[model_cfg.endpoint]
end
Uses Phase 1's ffi/curl.M.post (blocking POST, returns body + status).
5. Pillar 3 — Context:estimate_tokens widening
R1-revised — cache pattern is IN the reference code, not just prose:
function M.new(opts)
...
return setmetatable({
...
-- Phase 8: optional callback that returns an accurate token
-- count for a given text. Set by repl.lua when cfg.tokenize.
-- use_endpoint=true, calling broker.token_count(active_cfg, ...).
-- nil = char/4 fallback (Phase 0 §8 behavior).
tokenize_fn = opts.tokenize_fn,
}, Context)
end
function Context:estimate_tokens()
if self.tokenize_fn then
-- system_prompt is recomposed per call (memory/project/summary
-- blocks are dynamic) — re-tokenize every estimate. Bounded
-- by one round-trip.
local n = self.tokenize_fn(self.system_prompt)
-- R1: per-turn cache on the turn dict itself. Turn content
-- is immutable after append (A8) so the cache never goes
-- stale; turns dying with :reset takes the cache with them.
for _, t in ipairs(self.turns) do
if t._tokens == nil then
t._tokens = self.tokenize_fn(t.content)
end
n = n + t._tokens
end
return n
end
-- char/4 fallback (existing behavior)
local n = #self.system_prompt
for _, t in ipairs(self.turns) do n = n + #t.content end
return math.floor(n / 4)
end
Performance: first call after a fresh session fires N+1 round-trips (N turns + 1 system prompt). Subsequent calls fire 1 (system prompt)
- N dict lookups. For N=40, that's 40 × ~30ms = 1.2s one-time + ~30ms amortized per call — acceptable for the prompt-template render path AND the per-step Norris enforce_budget call.
6. Pillar 4 — :cost detail est-vs-actual
Current :cost detail (Phase 7) shows:
anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219
The 179 / 8 is prompt_tokens / completion_tokens SUMMED across all
calls in that slot — including any turns later evicted from context.
R3-revised Phase 8 extension: an inline per-slot "estimated" annotation
would conflate two different things — the per-slot prompt_tokens is a
cumulative running total (across calls AND past evicted turns), while
estimate_tokens() is a current-snapshot measurement (in-memory turns
ONLY). Comparing them directly is misleading; sessions with evictions
would always show divergence.
Instead, add a SINGLE trailing summary line after the slot rows:
... per-slot rows ...
[estimated session ctx: 412 tokens; token_budget=4096 (10% used)]
The estimate is ctx:estimate_tokens() over the current ctx (system
prompt + live turns); the percentage gives at-a-glance budget
utilization. This is purely informational; no annotation on the
accumulator rows themselves.
7. UX Surface Summary
| Meta | Behavior change |
|---|---|
:cost detail |
Adds ~est=N annotation per slot when heuristic disagreement >10% |
| (no new metas in v1) |
| Config | Default | Effect |
|---|---|---|
cfg.tokenize.use_endpoint |
false | When true, repl.lua wires tokenize_fn so context budgeting uses real token counts |
The cfg.tokenize block being opt-in is conservative: enabling it
means every Context:estimate_tokens() call may hit the broker. For
local llama.cpp the cost is ~50ms; for cloud-only configurations there
IS no /tokenize endpoint so we silently fall through to char/4 (cached
after one probe). No surprise; document in config example.
8. Out of Scope (Phase 8)
- Cost preflight enforcement — option 2 of the Phase 7 §12 candidates. The tokenize work here is a PREREQUISITE for accurate preflight cost estimation, but the enforcement layer itself (cap_at_dollars that REFUSES the call) is its own surface — defer to a separate phase.
- Cross-session cost rollup — option 1 of Phase 7 §12 candidates. Independent of tokenization.
- Streaming tokenize — some servers expose streaming tokenize endpoints for partial-prompt token counts during generation. Out of scope here; we use the blocking /tokenize for batch estimates.
- Multi-tokenizer support (e.g. tiktoken for OpenAI compat, sentencepiece for HuggingFace) — would require vendoring a C library (violates PHASE0 §3) or shelling out to python. Endpoint-based is the only substrate-compliant option for accuracy beyond char/4.
- Tokenization for
:cost detailrows that span multiple turns — the actualprompt_tokensin the accumulator slot is the sum ACROSS calls; the estimate for comparison should be over the CURRENT ctx content. Show the per-call comparison only.
9. Risks
| Risk | Mitigation |
|---|---|
/tokenize 404 silently cached as tokenize_supported = false for a typo'd endpoint config |
Per-session cache; restart re-probes. Acceptable. |
| Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency | turn._tokens per-turn cache set at append-time; only re-tokenize on cache miss. |
Hossenfelder proxy may forward /tokenize differently than direct llama.cpp (e.g., adds /v1/ prefix expected) |
B1 confirms /tokenize works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback). |
Cloud models without /tokenize emit no probes after first 404 — fine but :cost detail est-vs-actual will always agree (both are char/4 then) |
Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. |
Context:estimate_tokens callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips |
Per-turn cache makes amortized cost O(1) per turn after first count. |
Endpoint URL handling — currently endpoint .. "/v1/chat/completions" is hardcoded; tokenize uses endpoint .. "/tokenize" (no /v1) — asymmetric |
Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
| A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default token_budget = 4096 was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on cfg.context.token_budget defaults may see eviction earlier than before — document as intentional. Users can raise token_budget per their model's real context window. |
| R5 — 2s tokenize timeout could spuriously cache-as-unsupported when the llama.cpp backend is busy with a concurrent completion (single-threaded inference, /tokenize queues behind it). Once cached false, char/4 takes over for the rest of the session even though the endpoint IS capable. | 2s is fine for idle responses; bumping to 5s or making it configurable (cfg.tokenize.timeout_ms) is a v1.1 polish if it bites in practice. Documented; revisit during verify. |
10. Open Questions (Phase 8)
| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-T1 | Per-turn _tokens cache across :reset |
A4 — dies with turns; new turns get nil and lazy-set on first count. Trivial. | |
| Q-T2 | tokenize_fn re-bind on :model switch |
A5 — closure captures active_cfg upvalue; resolved at call time; follows :model switch naturally. No explicit re-binding needed. |
|
| Q-T3 | Probe respects opt-out | A6 — when cfg.tokenize.use_endpoint = false, repl.lua doesn't wire tokenize_fn; context.lua's nil branch takes the char/4 fallback. No probe call at all. |
|
| Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. | |
| Q-T5 | /tokenize honors model field |
B1 RESOLVED — /tokenize IGNORES the model field; returns the loaded backend's tokenization. Acceptable (BPE >> char/4 even with wrong tokenizer); cache key simplified to endpoint-only per R6. |
|
| Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. |
11. Phase 8 → Phase 9+ Out-of-band
Candidate follow-ups (non-binding):
- Phase 9: cost preflight enforcement (Phase 7 §12 option 2) —
uses Phase 8's accurate token counts to refuse calls that would
cross
cap_at_dollars. The accuracy work here is the foundation. - Cross-session cost rollup (Phase 7 §12 option 1) — independent; could land in parallel.
- Phase X: project-local config overlay (
.aish.lua) — was the alternative scope to Phase 7's cost work. Still valuable but independent of any current line.
Phase 8 itself is self-contained — no upstream dependencies.
13. Implementation Plan (commit-by-commit)
Bottom-up: broker first (the egress capability all callers depend on), then context (the consumer + the new pillar 5 budget extension), then repl.lua wiring + display, then config + status bump. Each commit leaves the tree green (existing tests + load smoke + per- commit feature smoke).
Order
-
broker.lua—M.token_counthelper + per-endpoint capability cache.- Module-local
_tokenize_capabletable keyed byendpoint .. "/" .. model. M.token_count(model_cfg, text):- empty text -> 0
- bad cfg (no endpoint) -> char/4 immediately
- capability cache says
falsefor this slot -> char/4 - otherwise: probe
<endpoint>/tokenizewith{content, model}body, 2s timeout. Onstatus == 200 + parseable {tokens=[...]}: cachetrue, return#tokens. Anything else (non-200, parse fail, transport err): cachefalse, char/4.
M.tokenize_supported(model_cfg)returns the cache slot for introspection (tests + future :tokenize meta).- Smoke: hand-call
M.token_count(local_cfg, "hello world")-> 2;M.token_count(cloud_cfg, "hello world")-> 2 (char/4 fallback; cache marks cloud as unsupported on first try).
- Module-local
-
context.lua— estimate_tokens widening + per-turn cache.- Context.new accepts
opts.tokenize_fn-> stored asself.tokenize_fn. Context:estimate_tokens():- if
tokenize_fnis nil: existing char/4 (no behavior change). - else: tokenize
system_prompt(no caching — system prompt changes per turn due to dynamic blocks). For each turn: ifturn._tokensis set use it; else compute via tokenize_fn AND cache on turn._tokens.
- if
- No new helper; the change is internal to estimate_tokens.
- Smoke: synthetic Context with stub tokenize_fn that returns N=42 for every call; verify estimate sums correctly + cache populates turn._tokens.
- Context.new accepts
-
context.lua— enforce_budget honors token_budget (pillar 5).- Existing
while #self.turns > self.max_turnsloop extended. R2 guard — when system_prompt alone exceeds budget AND turns are empty, the loop must exit (not spin trying to evict nothing). Correct condition:Lowercasewhile (#self.turns > self.max_turns or self:estimate_tokens() > self.token_budget) and #self.turns > 0 door/andper Lua syntax (N1). - Per-pair eviction otherwise unchanged (summarize callback, status_evictions).
- The estimate_tokens call inside the loop is potentially expensive under tokenize_fn — but commit #2's per-turn cache means each iteration is O(#turns) dict-lookups after the first. Acceptable for the eviction hot path.
- Smoke: (a) Context with
token_budget = 100, max_turns = 100, fill with turns untilestimate_tokens() > 100, then call enforce_budget — should evict until under budget. (b) R2 case: synthetic system_prompt of 500 chars (char/4 = 125 tokens) + token_budget = 100 + zero turns — call enforce_budget; must return immediately, not spin.
- Existing
-
repl.lua— tokenize_fn wiring + :cost detail estimate row.- When
config.tokenize and config.tokenize.use_endpoint, buildctx_opts.tokenize_fn = function(text) return broker.token_count(active_cfg, text) end. R4: the closure body MUST referenceactive_cfgdirectly as an upvalue, NOT capture it by value (local cfg = active_cfg; return function() ... cfg ... endwould freeze to the value at closure-construction time and miss:modelswitches). A5 verified upvalue semantics in Lua. :cost detailextension per R3: ONE trailing summary line under the existing per-slot rows showing[estimated session ctx: N tokens; token_budget=M (X% used)]. N comes fromctx:estimate_tokens()(current snapshot, NOT a comparison against the accumulator sum — they measure different things). M isctx.token_budget. X% = N/M × 100.- Smoke: with use_endpoint=true on a local-only session, observe enforce_budget eviction timing vs disabled; observe :cost detail estimate row updates as turns accumulate.
- When
-
config.luaexample block +docs/PHASE8.mdstatus bump.- Commented-out
tokenize = { use_endpoint = true }block in config.lua with parity to Phase 1-7 example blocks. Document the per-endpoint network cost (one probe per session) and the implication: token_budget actually enforces now. - PHASE8.md status header -> Implement.
- Commented-out
Risk index per commit
| Commit | Risk | Mitigation |
|---|---|---|
| 1 (broker) | Per-endpoint cache leaks across model_cfg deletions (e.g., user removes a model from config mid-session) | Cache is keyed by string; stale entries don't grow without bound (bounded by #configured models × 1). No GC needed. |
| 1 (broker) | /tokenize probe blocks the calling thread for 2s on a misconfigured endpoint | 2s timeout is the cap; one-shot per endpoint per session. |
| 2 (context) | per-turn _tokens cache miss on every estimate when no tokenize_fn -> existing perf preserved |
Cache check is conditional on tokenize_fn presence; char/4 path untouched. |
| 3 (context) | enforce_budget loop now calls estimate_tokens potentially every iteration; with tokenize_fn that's O(#turns) per iteration -> O(#turns^2) worst case | Per-turn cache makes this O(#turns) amortized after first fill. For typical max_turns=40 + token_budget=4096 sessions: ~40^2 dict lookups = 1600 ops in worst case, microsecond cost. |
| 3 (context) | accurate counts mean token_budget=4096 (Phase 0 default) finally ENFORCES — sessions that fit under char/4 may now evict earlier | Documented in §9; user can raise token_budget to match their model's real context window. |
| 4 (repl) | tokenize_fn closure binding to active_cfg upval — if upval somehow gets reassigned wrong, eviction uses wrong tokenizer |
Lua upvalues are call-time-resolved; A5 verified. Test by smoke after :model switch. |
| 5 (config + status) | none |
Tests + smoke per commit
Each commit:
- Pass
luajit test_safety.lua(87/87) andluajit test_router_model.lua(31/31) - Load cleanly via
luajit -e 'package.path=...; require("repl"); print("ok")' - Pass a per-feature smoke (described in each row above)
Things deliberately NOT split
- New module file for tokenize — small enough to live in broker.lua.
- Per-text token cache (in addition to per-turn): not needed; turn content is immutable post-append.
- :tokenize meta for introspecting the cache —
M.tokenize_supportedis exported for testing; if a user needs runtime visibility, that's a follow-up.
Open at plan-time (resolve at implement)
- :cost detail layout — how exactly to show "estimated session ctx" relative to the existing per-slot rows. Pick at commit 4 (likely a single trailing line under the detail table).
- Whether to expose
:tokenize <text>for direct-probe debugging. Nice-to-have; defer unless useful during verify.