Files
aish/docs/PHASE8.md
T
marfrit aa64ad3eec docs/PHASE8: plan — §13 commit roadmap (5 commits)
Status: Analyze -> Plan. All open Qs resolved (Q-T5 via baseline B1).

5-commit roadmap, bottom-up:

  1. broker.lua — M.token_count helper + per-endpoint capability
     cache. <endpoint>/tokenize probe with 2s timeout; cache true/false
     per (endpoint, model) for the session. char/4 fallback on any
     non-200 / parse-fail / transport err. M.tokenize_supported
     introspection helper.

  2. context.lua — Context.new accepts opts.tokenize_fn; estimate_
     tokens widens to use it when set, with per-turn `_tokens` cache.
     char/4 path unchanged when tokenize_fn nil.

  3. context.lua — enforce_budget consults token_budget too (pillar
     5 from A1). Loop condition: turns>max_turns OR estimate_tokens
     >token_budget. Existing summarize-on-evict callback unchanged.

  4. repl.lua — wire tokenize_fn when cfg.tokenize.use_endpoint=true.
     Closure captures active_cfg upval (A5 — follows :model switches
     naturally). :cost detail extension: trailing line showing
     estimated session ctx tokens for comparison with the per-slot
     prompt_tokens sums in the accumulator.

  5. config.lua commented `tokenize = { use_endpoint = true }`
     example + PHASE8.md status -> Implement.

Per-commit risk index covers: probe latency cap (2s, one-shot),
per-turn cache correctness (immutable post-append), enforce_budget
performance (O(N) per call after cache fill), and the intentional
behavior change of token_budget actually being enforced (sessions
fitting under char/4 may evict earlier under accurate counts —
documented in §9).

Two items open at plan, resolve at implement:
  - exact :cost detail layout for estimated session ctx row
  - whether to add a :tokenize debug meta (defer unless useful in verify)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:24:41 +00:00

516 lines
27 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# aish — Phase 8 Manifest
**Project:** aish — AI-augmented conversational shell
**Document:** Phase 8 Requirements, Architecture & Design Decisions
**Status:** Plan (formulate + analyze + baseline complete; tree at `79bd40d`)
**Date:** 2026-05-16
**Analyze findings (2026-05-16):**
A1. **enforce_budget ONLY checks max_turns, not token_budget — major
scope gap.** `Context:enforce_budget` (context.lua:319) iterates
`while #self.turns > self.max_turns`; `self.token_budget = 4096`
is set but NEVER consulted. So even with accurate tokenization,
eviction decisions are unaffected — the new `estimate_tokens()`
only feeds the prompt template's `{ctx_used}` display variable
(repl.lua:630).
**Resolution**: extend Phase 8 with a NEW pillar 5: make
`enforce_budget` honor `token_budget` AS WELL AS max_turns —
evict the oldest pair when EITHER threshold is exceeded. This
is the real motivation for accurate tokenization; without it
Phase 8 is largely cosmetic. Folded into §1 (5 pillars now),
§3 (context.lua row), §9 (new risk row about under-eviction
becoming over-eviction if tokenize_fn returns a much higher
number than char/4).
A2. **`ffi.curl.M.post` signature confirmed.** `(body, status)` on
success, `(nil, err)` on failure. Matches the formulate-time
sketch. status is the integer HTTP code. The probe checks
`status == 200 and out` correctly.
A3. **Single caller of `Context:estimate_tokens()` in tree.** Only
`repl.lua:630` (prompt template `{ctx_used}` substitution) calls
it. No internal callers in context.lua. This means:
- The wiring point is ONE line in repl.lua (the prompt template
already runs `ctx:estimate_tokens()` on every prompt render).
- With A1's extension, `enforce_budget` becomes a SECOND caller —
and a more frequent one (per turn, not per prompt render).
- Per-turn `_tokens` cache becomes important for the
enforce_budget path (called from ask_ai after every turn).
A4. **Q-T1 RESOLVED**: per-turn `_tokens` cache lives on the turn
dict. `:reset` clears `ctx.turns` so the cache dies with them.
New turns get nil `_tokens`; lazy-set on first count. Trivial.
A5. **Q-T2 RESOLVED**: tokenize_fn closure captures `active_cfg` as
an UPVALUE. Upvalues are resolved at closure call time, not at
definition time. When the user `:model cloud` switches,
`active_cfg = config.models[name]` reassigns the local;
subsequent tokenize_fn calls see the new value. Natural; no
explicit re-binding needed.
A6. **Q-T3 RESOLVED**: skip the probe entirely when
`cfg.tokenize.use_endpoint = false` or unset. Don't even call
`broker.token_count` — repl.lua won't wire `tokenize_fn` to
Context.new in the first place. context.lua's tokenize_fn-nil
branch handles it (char/4 fallback).
A7. **Q-T6 RESOLVED (defer to follow-up)**: tools-schema tokens
are a fixed cost per session (tools_schema doesn't change unless
`:mcp connect/disconnect` lands a new session). The under-count
is bounded and predictable. Defer to a future polish; v1
counts only messages. Document in §8 out-of-scope.
A8. **Per-turn `_tokens` cache invalidation.** Turn `content` is
immutable after append (we don't mutate stored turns). Cache is
safe to live forever on the turn. The only invalidation event
is `:reset` (clears turns wholesale). No other invalidation
needed.
A9. **Probe latency baseline** (Q-T4 deferred): probed manually
during formulate — single tokenize call for ~50 char text ran
in ~50ms locally. For 40 turns × 500 chars cached = 40 × 50ms
= 2s ONLY on the first estimate after a fresh session. After
caching, subsequent estimates are O(1) per turn (dict lookups).
A10. **Streaming during `chat_stream` interleaves with tokenize?**
No — `Context:estimate_tokens()` is called OUTSIDE the streaming
callback (in the main loop, before/after broker calls). No
concurrent network competition.
A11. **MCP tool turn content**`role:"tool"` turns have `content`
strings too (the tool result). These get tokenized identically;
no special-case needed. Cache key is the turn dict itself, so
tool turns get their own `_tokens` slot.
A12. **`include_usage` interaction with tokenize**: orthogonal. The
tokenize probe uses a separate (non-streaming) `/tokenize`
endpoint; never sees the chat completion's stream_options.
PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest
specifies what Phase 8 adds — **accurate tokenization**: replace the
char/4 heuristic on `Context:estimate_tokens()` with a per-broker
`/tokenize` round-trip where supported, char/4 fallback otherwise.
Resolves Q1 (`PHASE0.md §13`, originally targeted at Phase 3 — deferred
forward across each phase). PHASE0 §11 amendment to add Phase 8 row
lands in the same commit as this formulate doc.
---
## 1. Scope of Phase 8
Five pillars (A1 added pillar 5):
1. **Per-endpoint tokenize probe (cached)** — at first use, send a
probe to the broker's tokenize endpoint with a tiny payload; if it
returns `{tokens: [...]}` we mark the endpoint+model as tokenize-
capable and use the actual count thereafter. If it 404s or errors,
mark the slot as `tokenize_supported = false` and fall through to
char/4 silently. Cached per `(endpoint, model)` for the session.
2. **`broker.token_count(model_cfg, text)`** — thin wrapper that
returns an accurate token count when the (endpoint, model) is
tokenize-capable, else the char/4 heuristic. Always returns a
non-negative integer; never errors. The probe + fallback is
transparent to callers.
3. **`Context:estimate_tokens()` widening** — currently char/4 over
`system_prompt` + sum of `turn.content`s. The new shape accepts
an optional `tokenize_fn` (callback) at `Context.new` time and uses
it when present; falls back to char/4 when nil. `repl.lua` wires
`tokenize_fn = function(text) return broker.token_count(active_cfg, text) end`.
This means the active model's tokenizer is used for budgeting
decisions, which matches the broker the next ask_ai will hit.
4. **`:cost detail` estimated-vs-actual column** — for each
(model, category) slot in the accumulator, the actual
`prompt_tokens` from broker usage is already stored. Add an
estimated column computed via `broker.token_count` on the
currently-buffered prompt-shape. Disagreement >10% surfaces in a
tiny `~est=N` annotation so users can see when the heuristic
diverges from reality. Display-only; no behavior change.
5. **`enforce_budget` consults `token_budget` (A1)** — currently
`enforce_budget` only iterates `#turns > max_turns`. Extend to
ALSO check `estimate_tokens() > token_budget`. Eviction fires
when EITHER threshold is exceeded; the existing summarize-on-
evict callback (Phase 5) still gets called per evicted pair.
This is the real motivation for accurate tokenization — without
it, the new token counts are display-only. Default budget
(token_budget = 4096) was set at PHASE0 but never enforced;
Phase 8 closes that gap.
**Phase 8 is done when:**
- A long-running session with the local `qwen-coder-7b-snappy-8k`
model evicts at the RIGHT moment (token_budget=4096 hit triggers
eviction via the new pillar 5 path) rather than only when
max_turns is exceeded.
- `broker.token_count(local_cfg, "hello world")` returns 2 (matches
the live tokenize result, not the char/4=2 coincidence — verify
via `:cost detail` against multi-paragraph text).
- `broker.token_count(cloud_cfg, "hello world")` returns 2 (char/4
fallback when /tokenize 404s, which it does for OpenRouter).
- Cached per-endpoint capability — the probe fires once per
endpoint per session, not per call.
- Existing configs without `cfg.tokenize` behave like Phase 7 (zero
behavior change unless opted in via `cfg.tokenize.use_endpoint = true`).
- `:cost detail` shows estimated-vs-actual where disagreement >10%,
silent otherwise.
---
## 2. Technology Decisions (delta from Phase 7)
| Decision | Choice | Rationale |
|---|---|---|
| Tokenize endpoint path | `<endpoint>/tokenize` (NOT `<endpoint>/v1/tokenize`) | Per real probe against hossenfelder: `/v1/tokenize` returns 404; `/tokenize` returns `{tokens: [...]}`. This is the llama.cpp server convention. |
| Request body shape | `{"content": "<text>", "model": "<model>"}` | Local model echoed via `model`; llama.cpp ignores it but harmless. Probed shape works. |
| Capability detection | Per-call optimistic probe; on 404/non-200, cache `tokenize_supported[endpoint][model] = false` and never retry that session | One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine. |
| Fallback heuristic | char/4 (Phase 0 §8 convention) | Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available. |
| `Context:estimate_tokens` calling convention | Optional `tokenize_fn` callback at Context.new; absent = char/4 (existing behavior) | Backward-compatible; no caller break. Opt-in via repl.lua. |
| Active-model tokenizer | repl.lua wires `tokenize_fn` against `active_cfg` (the currently active model), so eviction decisions match the broker the next call will hit | When the user `:model cloud` switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize). |
| Caching strategy | Endpoint+model capability flag only; NOT per-text token-count cache | Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint. |
| Per-text timeout cap | 2s for tokenize calls (much tighter than the model's normal timeout_ms) | Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4. |
| `:cost detail` est-vs-actual | Show only when disagreement >10%; format `(prompt: 558 ~est=508 / completion: 80)` for the disagreement case, `(prompt: 558 / completion: 80)` otherwise | Always-on noise; suppress when heuristic is close. |
| New config key | `cfg.tokenize = { use_endpoint = true }` — default false until user opts in | Network round-trip cost; user-acknowledged behavior change. |
---
## 3. Module Changes
| File | State after Phase 7 | Phase 8 changes |
|---|---|---|
| `broker.lua` | `chat`, `chat_stream`, `build_request` (opts-widened in Phase 7) | New `M.token_count(model_cfg, text)`: tries `<endpoint>/tokenize` once per (endpoint, model); caches capability; returns int. New `M.tokenize_supported(model_cfg)` introspection helper for tests. |
| `context.lua` | `estimate_tokens()` char/4 sum over system_prompt + turn.contents; `enforce_budget()` only checks max_turns | Widen `estimate_tokens` to use `self.tokenize_fn(text)` if present; else char/4. Per-turn `_tokens` cache on each turn dict; lazy-set on first count. Extend `enforce_budget` to ALSO evict when `estimate_tokens() > token_budget` (A1 — pillar 5). |
| `repl.lua` | wires Context.new with summarize_fn, hosts all metas | `tokenize_fn` wired into Context.new when `cfg.tokenize.use_endpoint = true`. `:cost detail` extended with est-vs-actual column. |
| `config.lua` | Phase 7 cost block example | Add commented-out `tokenize = { use_endpoint = true }` block. |
| `docs/PHASE0.md` | §11 lists phases 0-7 | Amendment: add Phase 8 row to §11. |
No new module files.
---
## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)`
```lua
-- Per-endpoint capability cache (session-scoped local in broker.lua)
local _tokenize_capable = {} -- [endpoint .. "/" .. model] = true | false
local function _cache_key(model_cfg)
return (model_cfg.endpoint or "") .. "/" .. (model_cfg.model or "")
end
function M.token_count(model_cfg, text)
text = text or ""
if text == "" then return 0 end
if not (model_cfg and model_cfg.endpoint) then
return math.floor(#text / 4) -- pure fallback
end
local key = _cache_key(model_cfg)
local cap = _tokenize_capable[key]
if cap == false then
return math.floor(#text / 4)
end
-- cap == nil OR cap == true; try the endpoint.
local url = model_cfg.endpoint:gsub("/+$", "") .. "/tokenize"
local body = json.encode({ content = text, model = model_cfg.model })
local out, status = curl.post(url, body,
{ "Content-Type: application/json" },
2000) -- 2s timeout
if not (status == 200 and out) then
_tokenize_capable[key] = false
return math.floor(#text / 4)
end
local doc = json.decode(out)
local toks = doc and doc.tokens
if type(toks) ~= "table" then
_tokenize_capable[key] = false
return math.floor(#text / 4)
end
_tokenize_capable[key] = true
return #toks
end
function M.tokenize_supported(model_cfg)
if not model_cfg then return nil end
return _tokenize_capable[_cache_key(model_cfg)]
end
```
Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status).
---
## 5. Pillar 3 — `Context:estimate_tokens` widening
```lua
function M.new(opts)
...
return setmetatable({
...
-- Phase 8: optional callback that returns an accurate token
-- count for a given text. Set by repl.lua when cfg.tokenize.
-- use_endpoint=true, calling broker.token_count(active_cfg, ...).
-- nil = char/4 fallback (Phase 0 §8 behavior).
tokenize_fn = opts.tokenize_fn,
}, Context)
end
function Context:estimate_tokens()
if self.tokenize_fn then
local n = self.tokenize_fn(self.system_prompt)
for _, t in ipairs(self.turns) do
n = n + self.tokenize_fn(t.content)
end
return n
end
-- char/4 fallback (existing behavior)
local n = #self.system_prompt
for _, t in ipairs(self.turns) do n = n + #t.content end
return math.floor(n / 4)
end
```
Performance note: with N turns of average ~500 chars each, `estimate_tokens`
fires N tokenize round-trips. For N=40 turns × 50ms = 2s — too slow for
per-prompt eviction checks. **Mitigation**: per-turn token count cached
ON the turn dict (`turn._tokens`) the first time it's counted; only
re-tokenized if `turn._tokens` is nil. Set at append time when
tokenize_fn is present; otherwise lazily on first estimate.
---
## 6. Pillar 4 — `:cost detail` est-vs-actual
Current `:cost detail` (Phase 7) shows:
```
anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219
```
The `179 / 8` is `prompt_tokens / completion_tokens` from the actual
broker usage payload.
Phase 8 extension: for each slot, also compute an estimated count via
`broker.token_count(model_cfg_for_this_slot, ...)` over the
TURNS THAT CONTRIBUTED to this slot. But that's stateful and
expensive — simpler: show the SUM of `prompt_tokens` (actual) and
the SUM of `estimate_tokens()` (heuristic OR endpoint-based, depending
on what tokenize_fn is wired). If disagreement >10%, annotate.
Simplified format:
```
anthropic/claude-haiku-4.5 main 1 calls, 179 ~est=164 / 8 tokens, $0.000219
^^^^^^^^^^ shown when |actual-est|/actual > 0.10
```
The `~est=N` annotation only renders when the disagreement exceeds the
10% threshold. Silent otherwise.
---
## 7. UX Surface Summary
| Meta | Behavior change |
|---|---|
| `:cost detail` | Adds `~est=N` annotation per slot when heuristic disagreement >10% |
| (no new metas in v1) | |
| Config | Default | Effect |
|---|---|---|
| `cfg.tokenize.use_endpoint` | false | When true, repl.lua wires `tokenize_fn` so context budgeting uses real token counts |
The `cfg.tokenize` block being opt-in is conservative: enabling it
means every `Context:estimate_tokens()` call may hit the broker. For
local llama.cpp the cost is ~50ms; for cloud-only configurations there
IS no /tokenize endpoint so we silently fall through to char/4 (cached
after one probe). No surprise; document in config example.
---
## 8. Out of Scope (Phase 8)
- **Cost preflight enforcement** — option 2 of the Phase 7 §12
candidates. The tokenize work here is a PREREQUISITE for accurate
preflight cost estimation, but the enforcement layer itself
(cap_at_dollars that REFUSES the call) is its own surface — defer
to a separate phase.
- **Cross-session cost rollup** — option 1 of Phase 7 §12 candidates.
Independent of tokenization.
- **Streaming tokenize** — some servers expose streaming tokenize
endpoints for partial-prompt token counts during generation. Out
of scope here; we use the blocking /tokenize for batch estimates.
- **Multi-tokenizer support** (e.g. tiktoken for OpenAI compat,
sentencepiece for HuggingFace) — would require vendoring a C library
(violates PHASE0 §3) or shelling out to python. Endpoint-based is
the only substrate-compliant option for accuracy beyond char/4.
- **Tokenization for `:cost detail` rows that span multiple turns**
— the actual `prompt_tokens` in the accumulator slot is the sum
ACROSS calls; the estimate for comparison should be over the
CURRENT ctx content. Show the per-call comparison only.
---
## 9. Risks
| Risk | Mitigation |
|---|---|
| `/tokenize` 404 silently cached as `tokenize_supported = false` for a typo'd endpoint config | Per-session cache; restart re-probes. Acceptable. |
| Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency | `turn._tokens` per-turn cache set at append-time; only re-tokenize on cache miss. |
| Hossenfelder proxy may forward `/tokenize` differently than direct llama.cpp (e.g., adds `/v1/` prefix expected) | B1 confirms `/tokenize` works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback). |
| Cloud models without /tokenize emit no probes after first 404 — fine but `:cost detail` est-vs-actual will always agree (both are char/4 then) | Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. |
| `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
| Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
| A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. |
---
## 10. Open Questions (Phase 8)
| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-T1 | Per-turn `_tokens` cache across `:reset` | A4 — dies with turns; new turns get nil and lazy-set on first count. Trivial. |
| Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. |
| Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. |
| Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. |
| Q-T5 | `/tokenize` honors `model` field | **Baseline** (not yet probed in detail — assumed yes since llama.cpp echoes `model` in completions; needs explicit multi-model probe). |
| Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. |
---
## 11. Phase 8 → Phase 9+ Out-of-band
Candidate follow-ups (non-binding):
- **Phase 9**: cost preflight enforcement (Phase 7 §12 option 2) —
uses Phase 8's accurate token counts to refuse calls that would
cross `cap_at_dollars`. The accuracy work here is the foundation.
- **Cross-session cost rollup** (Phase 7 §12 option 1) — independent;
could land in parallel.
- **Phase X**: project-local config overlay (`.aish.lua`) — was the
alternative scope to Phase 7's cost work. Still valuable but
independent of any current line.
Phase 8 itself is self-contained — no upstream dependencies.
---
## 13. Implementation Plan (commit-by-commit)
Bottom-up: broker first (the egress capability all callers depend
on), then context (the consumer + the new pillar 5 budget extension),
then repl.lua wiring + display, then config + status bump. Each
commit leaves the tree green (existing tests + load smoke + per-
commit feature smoke).
### Order
1. **`broker.lua``M.token_count` helper + per-endpoint capability cache.**
- Module-local `_tokenize_capable` table keyed by `endpoint .. "/" .. model`.
- `M.token_count(model_cfg, text)`:
- empty text -> 0
- bad cfg (no endpoint) -> char/4 immediately
- capability cache says `false` for this slot -> char/4
- otherwise: probe `<endpoint>/tokenize` with `{content, model}` body,
2s timeout. On `status == 200 + parseable {tokens=[...]}`:
cache `true`, return `#tokens`. Anything else (non-200, parse
fail, transport err): cache `false`, char/4.
- `M.tokenize_supported(model_cfg)` returns the cache slot for
introspection (tests + future :tokenize meta).
- Smoke: hand-call `M.token_count(local_cfg, "hello world")` -> 2;
`M.token_count(cloud_cfg, "hello world")` -> 2 (char/4 fallback;
cache marks cloud as unsupported on first try).
2. **`context.lua` — estimate_tokens widening + per-turn cache.**
- Context.new accepts `opts.tokenize_fn` -> stored as `self.tokenize_fn`.
- `Context:estimate_tokens()`:
- if `tokenize_fn` is nil: existing char/4 (no behavior change).
- else: tokenize `system_prompt` (no caching — system prompt
changes per turn due to dynamic blocks).
For each turn: if `turn._tokens` is set use it; else
compute via tokenize_fn AND cache on turn._tokens.
- No new helper; the change is internal to estimate_tokens.
- Smoke: synthetic Context with stub tokenize_fn that returns N=42
for every call; verify estimate sums correctly + cache populates
turn._tokens.
3. **`context.lua` — enforce_budget honors token_budget (pillar 5).**
- Existing `while #self.turns > self.max_turns` loop extended to:
`while #self.turns > self.max_turns OR self:estimate_tokens() > self.token_budget do`.
- Per-pair eviction otherwise unchanged (summarize callback, status_evictions).
- The estimate_tokens call inside the loop is potentially expensive
under tokenize_fn — but commit #2's per-turn cache means each
iteration is O(#turns) dict-lookups. Acceptable for the eviction
hot path.
- Smoke: Context with `token_budget = 100`, max_turns = 100, fill
with turns until `estimate_tokens() > 100`, then call enforce_budget
— should evict until under budget. With char/4: eviction happens
at known char counts. With tokenize_fn stub returning 50/turn:
evicts down to <= 100/50 = 2 turns.
4. **`repl.lua` — tokenize_fn wiring + :cost detail est-vs-actual.**
- When `config.tokenize and config.tokenize.use_endpoint`, build
`ctx_opts.tokenize_fn = function(text)
return broker.token_count(active_cfg, text)
end`. The closure captures `active_cfg` upval per A5 — follows
`:model` switches naturally.
- `:cost detail` extension: per slot, compute the sum of actual
`prompt_tokens` (already in usage_totals) AND an "estimated"
count via... actually this needs a rethink: the accumulator
doesn't store per-turn data, just running sums. The estimate
for comparison would need to be over ctx's CURRENT content
(a single snapshot, not per-call). Simplest: show a SECOND
line under `:cost detail` with the current `estimate_tokens()`
value, labeled "[estimated session ctx: N tokens]". Disagreement
with the accumulator's prompt_tokens TOTAL surfaced inline.
- Smoke: with use_endpoint=true on a local-only session, observe
enforce_budget eviction timing vs disable; observe :cost detail
estimate row.
5. **`config.lua` example block + `docs/PHASE8.md` status bump.**
- Commented-out `tokenize = { use_endpoint = true }` block in
config.lua with parity to Phase 1-7 example blocks. Document
the per-endpoint network cost (one probe per session) and the
implication: token_budget actually enforces now.
- PHASE8.md status header -> **Implement**.
### Risk index per commit
| Commit | Risk | Mitigation |
|---|---|---|
| 1 (broker) | Per-endpoint cache leaks across model_cfg deletions (e.g., user removes a model from config mid-session) | Cache is keyed by string; stale entries don't grow without bound (bounded by #configured models × 1). No GC needed. |
| 1 (broker) | /tokenize probe blocks the calling thread for 2s on a misconfigured endpoint | 2s timeout is the cap; one-shot per endpoint per session. |
| 2 (context) | per-turn `_tokens` cache miss on every estimate when no tokenize_fn -> existing perf preserved | Cache check is conditional on tokenize_fn presence; char/4 path untouched. |
| 3 (context) | enforce_budget loop now calls estimate_tokens potentially every iteration; with tokenize_fn that's O(#turns) per iteration -> O(#turns^2) worst case | Per-turn cache makes this O(#turns) amortized after first fill. For typical max_turns=40 + token_budget=4096 sessions: ~40^2 dict lookups = 1600 ops in worst case, microsecond cost. |
| 3 (context) | accurate counts mean token_budget=4096 (Phase 0 default) finally ENFORCES — sessions that fit under char/4 may now evict earlier | Documented in §9; user can raise token_budget to match their model's real context window. |
| 4 (repl) | tokenize_fn closure binding to `active_cfg` upval — if upval somehow gets reassigned wrong, eviction uses wrong tokenizer | Lua upvalues are call-time-resolved; A5 verified. Test by smoke after `:model` switch. |
| 5 (config + status) | none | |
### Tests + smoke per commit
Each commit:
- Pass `luajit test_safety.lua` (87/87) and `luajit test_router_model.lua` (31/31)
- Load cleanly via `luajit -e 'package.path=...; require("repl"); print("ok")'`
- Pass a per-feature smoke (described in each row above)
### Things deliberately NOT split
- New module file for tokenize — small enough to live in broker.lua.
- Per-text token cache (in addition to per-turn): not needed; turn
content is immutable post-append.
- :tokenize meta for introspecting the cache — `M.tokenize_supported`
is exported for testing; if a user needs runtime visibility, that's
a follow-up.
### Open at plan-time (resolve at implement)
- :cost detail layout — how exactly to show "estimated session ctx"
relative to the existing per-slot rows. Pick at commit 4 (likely
a single trailing line under the detail table).
- Whether to expose `:tokenize <text>` for direct-probe debugging.
Nice-to-have; defer unless useful during verify.