Files
aish/docs/PHASE8.md
T
marfrit 467e573d24 docs/PHASE8: review fold-in — 2 BLOCKERs + 4 CONCERNs + 4 NITs
Sonnet-reviewed per reviews-use-sonnet memory directive.

BLOCKERs (RESOLVED in-place):

R1. §5 estimate_tokens pseudocode missing per-turn cache pattern.
    Prose described it; code block called tokenize_fn unconditionally.
    Implementer following code verbatim would hit the O(N round-
    trips per call) perf gap the prose flagged. Code block now
    shows explicit `if t._tokens then ... else t._tokens = ... end`.

R2. enforce_budget loop can spin forever when system_prompt alone
    exceeds token_budget (e.g. 5KB project block + budget=4096 +
    zero turns -> turns can't shrink further but OR-condition stays
    true). Fix: AND `#self.turns > 0` guard on the loop. §13 commit
    3 row shows the explicit Lua-syntax condition.

CONCERNs (FOLDED):

R3. :cost detail per-slot ~est=N annotation was semantically
    undefined — accumulator sum (cumulative across calls + evicted
    turns) vs current-snapshot estimate are incommensurable. §6
    reworked: ONE trailing summary line "[estimated session ctx:
    N tokens; token_budget=M (X% used)]" instead of per-slot
    annotations. §13 commit 4 aligned.

R4. tokenize_fn closure MUST reference active_cfg as upvalue (NOT
    capture by value). Subtle but easy to miss — §13 commit 4 now
    spells out the correct vs wrong patterns explicitly.

R5. 2s tokenize timeout can spuriously cache-as-unsupported when
    llama.cpp is busy with a concurrent completion (single-threaded
    inference; /tokenize queues behind). Documented in §9; v1
    ships 2s, revisit during verify if it bites.

R6. Per-endpoint cache key conflated two same-endpoint/different-
    model presets (B1: /tokenize ignores the model field). Cache
    key simplified to endpoint-only. One probe per endpoint per
    session; if a future broker honors the model field, revisit.

NITs (APPLIED):

N1. §13 commit 3 `OR`/`AND` -> Lua-syntax `or`/`and`.
N2. §10 Q-T5 Resolution-target cell filled in (was blank after B1).
N3. §6 / §8 / §13 commit 4 now describe a CONSISTENT approach
    (trailing summary line; per-slot annotation dropped).
N4. Status header tree-hash updated to current (aa64ad3 -> stays
    fresh through review fold-in; commit 5 will refresh again
    at "Implement" status).

PHASE8.md now 622 lines (was 454 after plan). +168/-61. Ready for
implementation phase 6 of the inner loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:28:27 +00:00

623 lines
33 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# aish — Phase 8 Manifest
**Project:** aish — AI-augmented conversational shell
**Document:** Phase 8 Requirements, Architecture & Design Decisions
**Status:** Plan + review fold-in (tree at `aa64ad3`)
**Date:** 2026-05-16
**Review findings (independent Sonnet agent, 2026-05-16) — 2 BLOCKERs
resolved in-place, 4 CONCERNs folded, 4 NITs applied:**
R1 (BLOCKER, RESOLVED). **§5 pseudocode missing per-turn cache pattern.**
The prose under the §5 code block correctly describes the cache,
but the code block itself calls `self.tokenize_fn(t.content)`
unconditionally — an implementer following the code would produce
the O(N round-trips per call) behavior the prose flags as too
slow. **Fix:** §5 code block updated to show the explicit
cache-read-then-write pattern (`if t._tokens then ... else
t._tokens = self.tokenize_fn(t.content) end`). §13 commit 2 row
also calls this out explicitly.
R2 (BLOCKER, RESOLVED). **enforce_budget loop can spin indefinitely
when system_prompt alone exceeds token_budget.** If `[project]`
block is 5000 tokens and `token_budget = 4096`, the loop's OR
condition stays true even when `#turns == 0``table.remove`
is a no-op, the loop never exits. **Fix:** §13 commit 3 row
updated to specify the explicit guard: `while (#self.turns >
self.max_turns or self:estimate_tokens() > self.token_budget)
and #self.turns > 0 do`. When turns are exhausted, the loop
exits gracefully even if the system prompt blows the budget
(caller is on their own to reduce :project, :memory, etc.).
R3 (CONCERN, FOLDED). **`:cost detail` comparison semantically
undefined.** Sum-of-prompt_tokens across all calls (accumulator)
vs current-snapshot estimate are incommensurable — sessions
with evictions ALWAYS show divergence not because heuristic is
wrong but because they measure different things. **Resolution:**
§6 reworked to drop the per-slot `~est=N` inline annotation
(which conflated the two); instead show a SINGLE trailing
"[estimated session ctx: N tokens]" line under :cost detail.
Cleanly separates the running-total accumulator from the
current-snapshot estimate. §13 commit 4 already pointed to this
direction — now §6 matches.
R4 (CONCERN, FOLDED). **tokenize_fn closure must reference `active_cfg`
by upvalue, not by value-capture.** If the implementer writes
`local cfg = active_cfg; return function(text) ... cfg ... end`,
the closure won't follow `:model` switches. **Fix:** §13 commit
4 row gains an explicit code note: the closure MUST be
`function(text) return broker.token_count(active_cfg, text) end`
— direct upvalue reference. A5 analysis verified upvalue
semantics; now spelled out so the implementer doesn't subtly
miss it.
R5 (CONCERN, FOLDED). **2s tokenize timeout can spuriously cache as
unsupported when llama.cpp is busy serving a concurrent
completion.** llama.cpp is single-threaded for inference; a
/tokenize request that arrives mid-generation queues behind
inference and may exceed the 2s cap. The capability would then
cache as `false` for the rest of the session, even though the
endpoint IS capable. **Fix:** §9 risk row added documenting
this. Mitigation: 2s is reasonable for IDLE responses but if
practical problems surface, bump to 5s or make configurable
(`cfg.tokenize.timeout_ms`). v1 ships 2s; revisit in verify if
it bites.
R6 (CONCERN, FOLDED). **Per-endpoint cache key conflates two
same-endpoint/different-model presets.** B1 confirmed
/tokenize ignores the model field, so two probes per session
when one would suffice. **Fix:** §4 cache key SIMPLIFIED to
just `model_cfg.endpoint` (B1-justified). Same-endpoint
presets share one cache entry; one probe per endpoint per
session, not per (endpoint, model). For a future broker that
DOES honor the model field, this design choice would need
revisiting — documented inline.
R-N1..N4 (NITs, APPLIED):
N1. §13 commit 3 condition uses uppercase `OR`/`AND` — corrected
to Lua's lowercase `or`/`and`.
N2. §10 Q-T5 row's "Resolution target" cell was empty; now reads
"Baseline (B1)" for consistency.
N3. §6 outdated inline `~est=N` description removed; new approach
(single trailing summary line) is documented; §8 out-of-scope
bullet about per-call comparison stays as the explicit "we
considered, rejected" record.
N4. PHASE8.md status header (formerly carrying a stale tree hash
that would drift before implementation) now references the
latest tree as of this fold-in (`aa64ad3`). Commit 5's status
bump to "Implement" will refresh it again at that point.
**Analyze findings (2026-05-16):**
A1. **enforce_budget ONLY checks max_turns, not token_budget — major
scope gap.** `Context:enforce_budget` (context.lua:319) iterates
`while #self.turns > self.max_turns`; `self.token_budget = 4096`
is set but NEVER consulted. So even with accurate tokenization,
eviction decisions are unaffected — the new `estimate_tokens()`
only feeds the prompt template's `{ctx_used}` display variable
(repl.lua:630).
**Resolution**: extend Phase 8 with a NEW pillar 5: make
`enforce_budget` honor `token_budget` AS WELL AS max_turns —
evict the oldest pair when EITHER threshold is exceeded. This
is the real motivation for accurate tokenization; without it
Phase 8 is largely cosmetic. Folded into §1 (5 pillars now),
§3 (context.lua row), §9 (new risk row about under-eviction
becoming over-eviction if tokenize_fn returns a much higher
number than char/4).
A2. **`ffi.curl.M.post` signature confirmed.** `(body, status)` on
success, `(nil, err)` on failure. Matches the formulate-time
sketch. status is the integer HTTP code. The probe checks
`status == 200 and out` correctly.
A3. **Single caller of `Context:estimate_tokens()` in tree.** Only
`repl.lua:630` (prompt template `{ctx_used}` substitution) calls
it. No internal callers in context.lua. This means:
- The wiring point is ONE line in repl.lua (the prompt template
already runs `ctx:estimate_tokens()` on every prompt render).
- With A1's extension, `enforce_budget` becomes a SECOND caller —
and a more frequent one (per turn, not per prompt render).
- Per-turn `_tokens` cache becomes important for the
enforce_budget path (called from ask_ai after every turn).
A4. **Q-T1 RESOLVED**: per-turn `_tokens` cache lives on the turn
dict. `:reset` clears `ctx.turns` so the cache dies with them.
New turns get nil `_tokens`; lazy-set on first count. Trivial.
A5. **Q-T2 RESOLVED**: tokenize_fn closure captures `active_cfg` as
an UPVALUE. Upvalues are resolved at closure call time, not at
definition time. When the user `:model cloud` switches,
`active_cfg = config.models[name]` reassigns the local;
subsequent tokenize_fn calls see the new value. Natural; no
explicit re-binding needed.
A6. **Q-T3 RESOLVED**: skip the probe entirely when
`cfg.tokenize.use_endpoint = false` or unset. Don't even call
`broker.token_count` — repl.lua won't wire `tokenize_fn` to
Context.new in the first place. context.lua's tokenize_fn-nil
branch handles it (char/4 fallback).
A7. **Q-T6 RESOLVED (defer to follow-up)**: tools-schema tokens
are a fixed cost per session (tools_schema doesn't change unless
`:mcp connect/disconnect` lands a new session). The under-count
is bounded and predictable. Defer to a future polish; v1
counts only messages. Document in §8 out-of-scope.
A8. **Per-turn `_tokens` cache invalidation.** Turn `content` is
immutable after append (we don't mutate stored turns). Cache is
safe to live forever on the turn. The only invalidation event
is `:reset` (clears turns wholesale). No other invalidation
needed.
A9. **Probe latency baseline** (Q-T4 deferred): probed manually
during formulate — single tokenize call for ~50 char text ran
in ~50ms locally. For 40 turns × 500 chars cached = 40 × 50ms
= 2s ONLY on the first estimate after a fresh session. After
caching, subsequent estimates are O(1) per turn (dict lookups).
A10. **Streaming during `chat_stream` interleaves with tokenize?**
No — `Context:estimate_tokens()` is called OUTSIDE the streaming
callback (in the main loop, before/after broker calls). No
concurrent network competition.
A11. **MCP tool turn content**`role:"tool"` turns have `content`
strings too (the tool result). These get tokenized identically;
no special-case needed. Cache key is the turn dict itself, so
tool turns get their own `_tokens` slot.
A12. **`include_usage` interaction with tokenize**: orthogonal. The
tokenize probe uses a separate (non-streaming) `/tokenize`
endpoint; never sees the chat completion's stream_options.
PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest
specifies what Phase 8 adds — **accurate tokenization**: replace the
char/4 heuristic on `Context:estimate_tokens()` with a per-broker
`/tokenize` round-trip where supported, char/4 fallback otherwise.
Resolves Q1 (`PHASE0.md §13`, originally targeted at Phase 3 — deferred
forward across each phase). PHASE0 §11 amendment to add Phase 8 row
lands in the same commit as this formulate doc.
---
## 1. Scope of Phase 8
Five pillars (A1 added pillar 5):
1. **Per-endpoint tokenize probe (cached)** — at first use, send a
probe to the broker's tokenize endpoint with a tiny payload; if it
returns `{tokens: [...]}` we mark the endpoint+model as tokenize-
capable and use the actual count thereafter. If it 404s or errors,
mark the slot as `tokenize_supported = false` and fall through to
char/4 silently. Cached per `(endpoint, model)` for the session.
2. **`broker.token_count(model_cfg, text)`** — thin wrapper that
returns an accurate token count when the (endpoint, model) is
tokenize-capable, else the char/4 heuristic. Always returns a
non-negative integer; never errors. The probe + fallback is
transparent to callers.
3. **`Context:estimate_tokens()` widening** — currently char/4 over
`system_prompt` + sum of `turn.content`s. The new shape accepts
an optional `tokenize_fn` (callback) at `Context.new` time and uses
it when present; falls back to char/4 when nil. `repl.lua` wires
`tokenize_fn = function(text) return broker.token_count(active_cfg, text) end`.
This means the active model's tokenizer is used for budgeting
decisions, which matches the broker the next ask_ai will hit.
4. **`:cost detail` estimated-vs-actual column** — for each
(model, category) slot in the accumulator, the actual
`prompt_tokens` from broker usage is already stored. Add an
estimated column computed via `broker.token_count` on the
currently-buffered prompt-shape. Disagreement >10% surfaces in a
tiny `~est=N` annotation so users can see when the heuristic
diverges from reality. Display-only; no behavior change.
5. **`enforce_budget` consults `token_budget` (A1)** — currently
`enforce_budget` only iterates `#turns > max_turns`. Extend to
ALSO check `estimate_tokens() > token_budget`. Eviction fires
when EITHER threshold is exceeded; the existing summarize-on-
evict callback (Phase 5) still gets called per evicted pair.
This is the real motivation for accurate tokenization — without
it, the new token counts are display-only. Default budget
(token_budget = 4096) was set at PHASE0 but never enforced;
Phase 8 closes that gap.
**Phase 8 is done when:**
- A long-running session with the local `qwen-coder-7b-snappy-8k`
model evicts at the RIGHT moment (token_budget=4096 hit triggers
eviction via the new pillar 5 path) rather than only when
max_turns is exceeded.
- `broker.token_count(local_cfg, "hello world")` returns 2 (matches
the live tokenize result, not the char/4=2 coincidence — verify
via `:cost detail` against multi-paragraph text).
- `broker.token_count(cloud_cfg, "hello world")` returns 2 (char/4
fallback when /tokenize 404s, which it does for OpenRouter).
- Cached per-endpoint capability — the probe fires once per
endpoint per session, not per call.
- Existing configs without `cfg.tokenize` behave like Phase 7 (zero
behavior change unless opted in via `cfg.tokenize.use_endpoint = true`).
- `:cost detail` shows estimated-vs-actual where disagreement >10%,
silent otherwise.
---
## 2. Technology Decisions (delta from Phase 7)
| Decision | Choice | Rationale |
|---|---|---|
| Tokenize endpoint path | `<endpoint>/tokenize` (NOT `<endpoint>/v1/tokenize`) | Per real probe against hossenfelder: `/v1/tokenize` returns 404; `/tokenize` returns `{tokens: [...]}`. This is the llama.cpp server convention. |
| Request body shape | `{"content": "<text>", "model": "<model>"}` | Local model echoed via `model`; llama.cpp ignores it but harmless. Probed shape works. |
| Capability detection | Per-call optimistic probe; on 404/non-200, cache `tokenize_supported[endpoint][model] = false` and never retry that session | One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine. |
| Fallback heuristic | char/4 (Phase 0 §8 convention) | Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available. |
| `Context:estimate_tokens` calling convention | Optional `tokenize_fn` callback at Context.new; absent = char/4 (existing behavior) | Backward-compatible; no caller break. Opt-in via repl.lua. |
| Active-model tokenizer | repl.lua wires `tokenize_fn` against `active_cfg` (the currently active model), so eviction decisions match the broker the next call will hit | When the user `:model cloud` switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize). |
| Caching strategy | Endpoint+model capability flag only; NOT per-text token-count cache | Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint. |
| Per-text timeout cap | 2s for tokenize calls (much tighter than the model's normal timeout_ms) | Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4. |
| `:cost detail` est-vs-actual | Show only when disagreement >10%; format `(prompt: 558 ~est=508 / completion: 80)` for the disagreement case, `(prompt: 558 / completion: 80)` otherwise | Always-on noise; suppress when heuristic is close. |
| New config key | `cfg.tokenize = { use_endpoint = true }` — default false until user opts in | Network round-trip cost; user-acknowledged behavior change. |
---
## 3. Module Changes
| File | State after Phase 7 | Phase 8 changes |
|---|---|---|
| `broker.lua` | `chat`, `chat_stream`, `build_request` (opts-widened in Phase 7) | New `M.token_count(model_cfg, text)`: tries `<endpoint>/tokenize` once per (endpoint, model); caches capability; returns int. New `M.tokenize_supported(model_cfg)` introspection helper for tests. |
| `context.lua` | `estimate_tokens()` char/4 sum over system_prompt + turn.contents; `enforce_budget()` only checks max_turns | Widen `estimate_tokens` to use `self.tokenize_fn(text)` if present; else char/4. Per-turn `_tokens` cache on each turn dict; lazy-set on first count. Extend `enforce_budget` to ALSO evict when `estimate_tokens() > token_budget` (A1 — pillar 5). |
| `repl.lua` | wires Context.new with summarize_fn, hosts all metas | `tokenize_fn` wired into Context.new when `cfg.tokenize.use_endpoint = true`. `:cost detail` extended with est-vs-actual column. |
| `config.lua` | Phase 7 cost block example | Add commented-out `tokenize = { use_endpoint = true }` block. |
| `docs/PHASE0.md` | §11 lists phases 0-7 | Amendment: add Phase 8 row to §11. |
No new module files.
---
## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)`
R6-revised — cache key is endpoint-only (B1: /tokenize ignores the
model field so two presets sharing an endpoint share one cache entry):
```lua
-- Per-endpoint capability cache (session-scoped local in broker.lua).
-- Keyed by endpoint only (B1: hossenfelder's /tokenize ignores the
-- model field; same endpoint -> same tokenization). If a future
-- broker honors the model field, revisit this keying.
local _tokenize_capable = {} -- [endpoint] = true | false
function M.token_count(model_cfg, text)
text = text or ""
if text == "" then return 0 end
if not (model_cfg and model_cfg.endpoint) then
return math.floor(#text / 4) -- pure fallback
end
local ep = model_cfg.endpoint
local cap = _tokenize_capable[ep]
if cap == false then
return math.floor(#text / 4)
end
-- cap == nil OR cap == true; try the endpoint.
local url = ep:gsub("/+$", "") .. "/tokenize"
local body = json.encode({ content = text, model = model_cfg.model })
local out, status = curl.post(url, body,
{ "Content-Type: application/json" },
2000) -- 2s timeout
if not (status == 200 and out) then
_tokenize_capable[ep] = false
return math.floor(#text / 4)
end
local doc = json.decode(out)
local toks = doc and doc.tokens
if type(toks) ~= "table" then
_tokenize_capable[ep] = false
return math.floor(#text / 4)
end
_tokenize_capable[ep] = true
return #toks
end
function M.tokenize_supported(model_cfg)
if not (model_cfg and model_cfg.endpoint) then return nil end
return _tokenize_capable[model_cfg.endpoint]
end
```
Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status).
---
## 5. Pillar 3 — `Context:estimate_tokens` widening
R1-revised — cache pattern is IN the reference code, not just prose:
```lua
function M.new(opts)
...
return setmetatable({
...
-- Phase 8: optional callback that returns an accurate token
-- count for a given text. Set by repl.lua when cfg.tokenize.
-- use_endpoint=true, calling broker.token_count(active_cfg, ...).
-- nil = char/4 fallback (Phase 0 §8 behavior).
tokenize_fn = opts.tokenize_fn,
}, Context)
end
function Context:estimate_tokens()
if self.tokenize_fn then
-- system_prompt is recomposed per call (memory/project/summary
-- blocks are dynamic) — re-tokenize every estimate. Bounded
-- by one round-trip.
local n = self.tokenize_fn(self.system_prompt)
-- R1: per-turn cache on the turn dict itself. Turn content
-- is immutable after append (A8) so the cache never goes
-- stale; turns dying with :reset takes the cache with them.
for _, t in ipairs(self.turns) do
if t._tokens == nil then
t._tokens = self.tokenize_fn(t.content)
end
n = n + t._tokens
end
return n
end
-- char/4 fallback (existing behavior)
local n = #self.system_prompt
for _, t in ipairs(self.turns) do n = n + #t.content end
return math.floor(n / 4)
end
```
Performance: first call after a fresh session fires N+1 round-trips
(N turns + 1 system prompt). Subsequent calls fire 1 (system prompt)
+ N dict lookups. For N=40, that's 40 × ~30ms = 1.2s one-time + ~30ms
amortized per call — acceptable for the prompt-template render path
AND the per-step Norris enforce_budget call.
---
## 6. Pillar 4 — `:cost detail` est-vs-actual
Current `:cost detail` (Phase 7) shows:
```
anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219
```
The `179 / 8` is `prompt_tokens / completion_tokens` SUMMED across all
calls in that slot — including any turns later evicted from context.
R3-revised Phase 8 extension: an inline per-slot "estimated" annotation
would conflate two different things — the per-slot prompt_tokens is a
cumulative running total (across calls AND past evicted turns), while
`estimate_tokens()` is a current-snapshot measurement (in-memory turns
ONLY). Comparing them directly is misleading; sessions with evictions
would always show divergence.
Instead, add a SINGLE trailing summary line after the slot rows:
```
... per-slot rows ...
[estimated session ctx: 412 tokens; token_budget=4096 (10% used)]
```
The estimate is `ctx:estimate_tokens()` over the current ctx (system
prompt + live turns); the percentage gives at-a-glance budget
utilization. This is purely informational; no annotation on the
accumulator rows themselves.
---
## 7. UX Surface Summary
| Meta | Behavior change |
|---|---|
| `:cost detail` | Adds `~est=N` annotation per slot when heuristic disagreement >10% |
| (no new metas in v1) | |
| Config | Default | Effect |
|---|---|---|
| `cfg.tokenize.use_endpoint` | false | When true, repl.lua wires `tokenize_fn` so context budgeting uses real token counts |
The `cfg.tokenize` block being opt-in is conservative: enabling it
means every `Context:estimate_tokens()` call may hit the broker. For
local llama.cpp the cost is ~50ms; for cloud-only configurations there
IS no /tokenize endpoint so we silently fall through to char/4 (cached
after one probe). No surprise; document in config example.
---
## 8. Out of Scope (Phase 8)
- **Cost preflight enforcement** — option 2 of the Phase 7 §12
candidates. The tokenize work here is a PREREQUISITE for accurate
preflight cost estimation, but the enforcement layer itself
(cap_at_dollars that REFUSES the call) is its own surface — defer
to a separate phase.
- **Cross-session cost rollup** — option 1 of Phase 7 §12 candidates.
Independent of tokenization.
- **Streaming tokenize** — some servers expose streaming tokenize
endpoints for partial-prompt token counts during generation. Out
of scope here; we use the blocking /tokenize for batch estimates.
- **Multi-tokenizer support** (e.g. tiktoken for OpenAI compat,
sentencepiece for HuggingFace) — would require vendoring a C library
(violates PHASE0 §3) or shelling out to python. Endpoint-based is
the only substrate-compliant option for accuracy beyond char/4.
- **Tokenization for `:cost detail` rows that span multiple turns**
— the actual `prompt_tokens` in the accumulator slot is the sum
ACROSS calls; the estimate for comparison should be over the
CURRENT ctx content. Show the per-call comparison only.
---
## 9. Risks
| Risk | Mitigation |
|---|---|
| `/tokenize` 404 silently cached as `tokenize_supported = false` for a typo'd endpoint config | Per-session cache; restart re-probes. Acceptable. |
| Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency | `turn._tokens` per-turn cache set at append-time; only re-tokenize on cache miss. |
| Hossenfelder proxy may forward `/tokenize` differently than direct llama.cpp (e.g., adds `/v1/` prefix expected) | B1 confirms `/tokenize` works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback). |
| Cloud models without /tokenize emit no probes after first 404 — fine but `:cost detail` est-vs-actual will always agree (both are char/4 then) | Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. |
| `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
| Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
| A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. |
| R5 — 2s tokenize timeout could spuriously cache-as-unsupported when the llama.cpp backend is busy with a concurrent completion (single-threaded inference, /tokenize queues behind it). Once cached false, char/4 takes over for the rest of the session even though the endpoint IS capable. | 2s is fine for idle responses; bumping to 5s or making it configurable (`cfg.tokenize.timeout_ms`) is a v1.1 polish if it bites in practice. Documented; revisit during verify. |
---
## 10. Open Questions (Phase 8)
| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-T1 | Per-turn `_tokens` cache across `:reset` | A4 — dies with turns; new turns get nil and lazy-set on first count. Trivial. |
| Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. |
| Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. |
| Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. |
| Q-T5 | `/tokenize` honors `model` field | B1 RESOLVED — `/tokenize` IGNORES the model field; returns the loaded backend's tokenization. Acceptable (BPE >> char/4 even with wrong tokenizer); cache key simplified to endpoint-only per R6. |
| Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. |
---
## 11. Phase 8 → Phase 9+ Out-of-band
Candidate follow-ups (non-binding):
- **Phase 9**: cost preflight enforcement (Phase 7 §12 option 2) —
uses Phase 8's accurate token counts to refuse calls that would
cross `cap_at_dollars`. The accuracy work here is the foundation.
- **Cross-session cost rollup** (Phase 7 §12 option 1) — independent;
could land in parallel.
- **Phase X**: project-local config overlay (`.aish.lua`) — was the
alternative scope to Phase 7's cost work. Still valuable but
independent of any current line.
Phase 8 itself is self-contained — no upstream dependencies.
---
## 13. Implementation Plan (commit-by-commit)
Bottom-up: broker first (the egress capability all callers depend
on), then context (the consumer + the new pillar 5 budget extension),
then repl.lua wiring + display, then config + status bump. Each
commit leaves the tree green (existing tests + load smoke + per-
commit feature smoke).
### Order
1. **`broker.lua``M.token_count` helper + per-endpoint capability cache.**
- Module-local `_tokenize_capable` table keyed by `endpoint .. "/" .. model`.
- `M.token_count(model_cfg, text)`:
- empty text -> 0
- bad cfg (no endpoint) -> char/4 immediately
- capability cache says `false` for this slot -> char/4
- otherwise: probe `<endpoint>/tokenize` with `{content, model}` body,
2s timeout. On `status == 200 + parseable {tokens=[...]}`:
cache `true`, return `#tokens`. Anything else (non-200, parse
fail, transport err): cache `false`, char/4.
- `M.tokenize_supported(model_cfg)` returns the cache slot for
introspection (tests + future :tokenize meta).
- Smoke: hand-call `M.token_count(local_cfg, "hello world")` -> 2;
`M.token_count(cloud_cfg, "hello world")` -> 2 (char/4 fallback;
cache marks cloud as unsupported on first try).
2. **`context.lua` — estimate_tokens widening + per-turn cache.**
- Context.new accepts `opts.tokenize_fn` -> stored as `self.tokenize_fn`.
- `Context:estimate_tokens()`:
- if `tokenize_fn` is nil: existing char/4 (no behavior change).
- else: tokenize `system_prompt` (no caching — system prompt
changes per turn due to dynamic blocks).
For each turn: if `turn._tokens` is set use it; else
compute via tokenize_fn AND cache on turn._tokens.
- No new helper; the change is internal to estimate_tokens.
- Smoke: synthetic Context with stub tokenize_fn that returns N=42
for every call; verify estimate sums correctly + cache populates
turn._tokens.
3. **`context.lua` — enforce_budget honors token_budget (pillar 5).**
- Existing `while #self.turns > self.max_turns` loop extended.
**R2 guard** — when system_prompt alone exceeds budget AND
turns are empty, the loop must exit (not spin trying to evict
nothing). Correct condition:
```lua
while (#self.turns > self.max_turns
or self:estimate_tokens() > self.token_budget)
and #self.turns > 0 do
```
Lowercase `or`/`and` per Lua syntax (N1).
- Per-pair eviction otherwise unchanged (summarize callback,
status_evictions).
- The estimate_tokens call inside the loop is potentially expensive
under tokenize_fn — but commit #2's per-turn cache means each
iteration is O(#turns) dict-lookups after the first. Acceptable
for the eviction hot path.
- Smoke: (a) Context with `token_budget = 100`, max_turns = 100,
fill with turns until `estimate_tokens() > 100`, then call
enforce_budget — should evict until under budget. (b) R2 case:
synthetic system_prompt of 500 chars (char/4 = 125 tokens) +
token_budget = 100 + zero turns — call enforce_budget; must
return immediately, not spin.
4. **`repl.lua` — tokenize_fn wiring + :cost detail estimate row.**
- When `config.tokenize and config.tokenize.use_endpoint`, build
`ctx_opts.tokenize_fn = function(text)
return broker.token_count(active_cfg, text)
end`. **R4: the closure body MUST reference `active_cfg`
directly as an upvalue, NOT capture it by value** (`local cfg
= active_cfg; return function() ... cfg ... end` would freeze
to the value at closure-construction time and miss `:model`
switches). A5 verified upvalue semantics in Lua.
- `:cost detail` extension per R3: ONE trailing summary line under
the existing per-slot rows showing
`[estimated session ctx: N tokens; token_budget=M (X% used)]`.
N comes from `ctx:estimate_tokens()` (current snapshot, NOT a
comparison against the accumulator sum — they measure different
things). M is `ctx.token_budget`. X% = N/M × 100.
- Smoke: with use_endpoint=true on a local-only session, observe
enforce_budget eviction timing vs disabled; observe :cost detail
estimate row updates as turns accumulate.
5. **`config.lua` example block + `docs/PHASE8.md` status bump.**
- Commented-out `tokenize = { use_endpoint = true }` block in
config.lua with parity to Phase 1-7 example blocks. Document
the per-endpoint network cost (one probe per session) and the
implication: token_budget actually enforces now.
- PHASE8.md status header -> **Implement**.
### Risk index per commit
| Commit | Risk | Mitigation |
|---|---|---|
| 1 (broker) | Per-endpoint cache leaks across model_cfg deletions (e.g., user removes a model from config mid-session) | Cache is keyed by string; stale entries don't grow without bound (bounded by #configured models × 1). No GC needed. |
| 1 (broker) | /tokenize probe blocks the calling thread for 2s on a misconfigured endpoint | 2s timeout is the cap; one-shot per endpoint per session. |
| 2 (context) | per-turn `_tokens` cache miss on every estimate when no tokenize_fn -> existing perf preserved | Cache check is conditional on tokenize_fn presence; char/4 path untouched. |
| 3 (context) | enforce_budget loop now calls estimate_tokens potentially every iteration; with tokenize_fn that's O(#turns) per iteration -> O(#turns^2) worst case | Per-turn cache makes this O(#turns) amortized after first fill. For typical max_turns=40 + token_budget=4096 sessions: ~40^2 dict lookups = 1600 ops in worst case, microsecond cost. |
| 3 (context) | accurate counts mean token_budget=4096 (Phase 0 default) finally ENFORCES — sessions that fit under char/4 may now evict earlier | Documented in §9; user can raise token_budget to match their model's real context window. |
| 4 (repl) | tokenize_fn closure binding to `active_cfg` upval — if upval somehow gets reassigned wrong, eviction uses wrong tokenizer | Lua upvalues are call-time-resolved; A5 verified. Test by smoke after `:model` switch. |
| 5 (config + status) | none | |
### Tests + smoke per commit
Each commit:
- Pass `luajit test_safety.lua` (87/87) and `luajit test_router_model.lua` (31/31)
- Load cleanly via `luajit -e 'package.path=...; require("repl"); print("ok")'`
- Pass a per-feature smoke (described in each row above)
### Things deliberately NOT split
- New module file for tokenize — small enough to live in broker.lua.
- Per-text token cache (in addition to per-turn): not needed; turn
content is immutable post-append.
- :tokenize meta for introspecting the cache — `M.tokenize_supported`
is exported for testing; if a user needs runtime visibility, that's
a follow-up.
### Open at plan-time (resolve at implement)
- :cost detail layout — how exactly to show "estimated session ctx"
relative to the existing per-slot rows. Pick at commit 4 (likely
a single trailing line under the detail table).
- Whether to expose `:tokenize <text>` for direct-probe debugging.
Nice-to-have; defer unless useful during verify.