08dba69fce
config.lua:
- Commented-out `tokenize = { use_endpoint = true }` block with
parity to the Phase 1-7 example blocks.
- Documents the two consequences: (1) per-turn network cost
(~30ms first time, cached after) and (2) token_budget is now
actually enforced — sessions that fit under char/4 may evict
earlier under accurate counts.
- Notes cloud /tokenize 404 fallback path.
docs/PHASE8.md:
- Status header bumped: "Plan + review fold-in" -> "Implement"
- Lists the 5 implement commits inline for traceability:
7ef2a6e broker: token_count + endpoint cache
8502517 context: tokenize_fn + _tokens cache
db26d0c context: enforce_budget honors token_budget (R2 guard)
94b7d86 repl: wire tokenize_fn + :cost detail estimate row
this config example + status bump
Phase 8 implementation is complete. Resolves Q1 (PHASE0 §13,
originally Phase 3, deferred forward). Next inner-loop step: verify
(7) — file test cases, run autonomous, close. Then memory-update (8).
Regression: test_safety 87/87, test_router_model 31/31, repl loads.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
623 lines
33 KiB
Markdown
623 lines
33 KiB
Markdown
# aish — Phase 8 Manifest
|
||
|
||
**Project:** aish — AI-augmented conversational shell
|
||
**Document:** Phase 8 Requirements, Architecture & Design Decisions
|
||
**Status:** Implement (5 commits landed: 7ef2a6e, 8502517, db26d0c, 94b7d86, this)
|
||
**Date:** 2026-05-16
|
||
|
||
**Review findings (independent Sonnet agent, 2026-05-16) — 2 BLOCKERs
|
||
resolved in-place, 4 CONCERNs folded, 4 NITs applied:**
|
||
|
||
R1 (BLOCKER, RESOLVED). **§5 pseudocode missing per-turn cache pattern.**
|
||
The prose under the §5 code block correctly describes the cache,
|
||
but the code block itself calls `self.tokenize_fn(t.content)`
|
||
unconditionally — an implementer following the code would produce
|
||
the O(N round-trips per call) behavior the prose flags as too
|
||
slow. **Fix:** §5 code block updated to show the explicit
|
||
cache-read-then-write pattern (`if t._tokens then ... else
|
||
t._tokens = self.tokenize_fn(t.content) end`). §13 commit 2 row
|
||
also calls this out explicitly.
|
||
|
||
R2 (BLOCKER, RESOLVED). **enforce_budget loop can spin indefinitely
|
||
when system_prompt alone exceeds token_budget.** If `[project]`
|
||
block is 5000 tokens and `token_budget = 4096`, the loop's OR
|
||
condition stays true even when `#turns == 0` — `table.remove`
|
||
is a no-op, the loop never exits. **Fix:** §13 commit 3 row
|
||
updated to specify the explicit guard: `while (#self.turns >
|
||
self.max_turns or self:estimate_tokens() > self.token_budget)
|
||
and #self.turns > 0 do`. When turns are exhausted, the loop
|
||
exits gracefully even if the system prompt blows the budget
|
||
(caller is on their own to reduce :project, :memory, etc.).
|
||
|
||
R3 (CONCERN, FOLDED). **`:cost detail` comparison semantically
|
||
undefined.** Sum-of-prompt_tokens across all calls (accumulator)
|
||
vs current-snapshot estimate are incommensurable — sessions
|
||
with evictions ALWAYS show divergence not because heuristic is
|
||
wrong but because they measure different things. **Resolution:**
|
||
§6 reworked to drop the per-slot `~est=N` inline annotation
|
||
(which conflated the two); instead show a SINGLE trailing
|
||
"[estimated session ctx: N tokens]" line under :cost detail.
|
||
Cleanly separates the running-total accumulator from the
|
||
current-snapshot estimate. §13 commit 4 already pointed to this
|
||
direction — now §6 matches.
|
||
|
||
R4 (CONCERN, FOLDED). **tokenize_fn closure must reference `active_cfg`
|
||
by upvalue, not by value-capture.** If the implementer writes
|
||
`local cfg = active_cfg; return function(text) ... cfg ... end`,
|
||
the closure won't follow `:model` switches. **Fix:** §13 commit
|
||
4 row gains an explicit code note: the closure MUST be
|
||
`function(text) return broker.token_count(active_cfg, text) end`
|
||
— direct upvalue reference. A5 analysis verified upvalue
|
||
semantics; now spelled out so the implementer doesn't subtly
|
||
miss it.
|
||
|
||
R5 (CONCERN, FOLDED). **2s tokenize timeout can spuriously cache as
|
||
unsupported when llama.cpp is busy serving a concurrent
|
||
completion.** llama.cpp is single-threaded for inference; a
|
||
/tokenize request that arrives mid-generation queues behind
|
||
inference and may exceed the 2s cap. The capability would then
|
||
cache as `false` for the rest of the session, even though the
|
||
endpoint IS capable. **Fix:** §9 risk row added documenting
|
||
this. Mitigation: 2s is reasonable for IDLE responses but if
|
||
practical problems surface, bump to 5s or make configurable
|
||
(`cfg.tokenize.timeout_ms`). v1 ships 2s; revisit in verify if
|
||
it bites.
|
||
|
||
R6 (CONCERN, FOLDED). **Per-endpoint cache key conflates two
|
||
same-endpoint/different-model presets.** B1 confirmed
|
||
/tokenize ignores the model field, so two probes per session
|
||
when one would suffice. **Fix:** §4 cache key SIMPLIFIED to
|
||
just `model_cfg.endpoint` (B1-justified). Same-endpoint
|
||
presets share one cache entry; one probe per endpoint per
|
||
session, not per (endpoint, model). For a future broker that
|
||
DOES honor the model field, this design choice would need
|
||
revisiting — documented inline.
|
||
|
||
R-N1..N4 (NITs, APPLIED):
|
||
N1. §13 commit 3 condition uses uppercase `OR`/`AND` — corrected
|
||
to Lua's lowercase `or`/`and`.
|
||
N2. §10 Q-T5 row's "Resolution target" cell was empty; now reads
|
||
"Baseline (B1)" for consistency.
|
||
N3. §6 outdated inline `~est=N` description removed; new approach
|
||
(single trailing summary line) is documented; §8 out-of-scope
|
||
bullet about per-call comparison stays as the explicit "we
|
||
considered, rejected" record.
|
||
N4. PHASE8.md status header (formerly carrying a stale tree hash
|
||
that would drift before implementation) now references the
|
||
latest tree as of this fold-in (`aa64ad3`). Commit 5's status
|
||
bump to "Implement" will refresh it again at that point.
|
||
|
||
**Analyze findings (2026-05-16):**
|
||
|
||
A1. **enforce_budget ONLY checks max_turns, not token_budget — major
|
||
scope gap.** `Context:enforce_budget` (context.lua:319) iterates
|
||
`while #self.turns > self.max_turns`; `self.token_budget = 4096`
|
||
is set but NEVER consulted. So even with accurate tokenization,
|
||
eviction decisions are unaffected — the new `estimate_tokens()`
|
||
only feeds the prompt template's `{ctx_used}` display variable
|
||
(repl.lua:630).
|
||
|
||
**Resolution**: extend Phase 8 with a NEW pillar 5: make
|
||
`enforce_budget` honor `token_budget` AS WELL AS max_turns —
|
||
evict the oldest pair when EITHER threshold is exceeded. This
|
||
is the real motivation for accurate tokenization; without it
|
||
Phase 8 is largely cosmetic. Folded into §1 (5 pillars now),
|
||
§3 (context.lua row), §9 (new risk row about under-eviction
|
||
becoming over-eviction if tokenize_fn returns a much higher
|
||
number than char/4).
|
||
|
||
A2. **`ffi.curl.M.post` signature confirmed.** `(body, status)` on
|
||
success, `(nil, err)` on failure. Matches the formulate-time
|
||
sketch. status is the integer HTTP code. The probe checks
|
||
`status == 200 and out` correctly.
|
||
|
||
A3. **Single caller of `Context:estimate_tokens()` in tree.** Only
|
||
`repl.lua:630` (prompt template `{ctx_used}` substitution) calls
|
||
it. No internal callers in context.lua. This means:
|
||
- The wiring point is ONE line in repl.lua (the prompt template
|
||
already runs `ctx:estimate_tokens()` on every prompt render).
|
||
- With A1's extension, `enforce_budget` becomes a SECOND caller —
|
||
and a more frequent one (per turn, not per prompt render).
|
||
- Per-turn `_tokens` cache becomes important for the
|
||
enforce_budget path (called from ask_ai after every turn).
|
||
|
||
A4. **Q-T1 RESOLVED**: per-turn `_tokens` cache lives on the turn
|
||
dict. `:reset` clears `ctx.turns` so the cache dies with them.
|
||
New turns get nil `_tokens`; lazy-set on first count. Trivial.
|
||
|
||
A5. **Q-T2 RESOLVED**: tokenize_fn closure captures `active_cfg` as
|
||
an UPVALUE. Upvalues are resolved at closure call time, not at
|
||
definition time. When the user `:model cloud` switches,
|
||
`active_cfg = config.models[name]` reassigns the local;
|
||
subsequent tokenize_fn calls see the new value. Natural; no
|
||
explicit re-binding needed.
|
||
|
||
A6. **Q-T3 RESOLVED**: skip the probe entirely when
|
||
`cfg.tokenize.use_endpoint = false` or unset. Don't even call
|
||
`broker.token_count` — repl.lua won't wire `tokenize_fn` to
|
||
Context.new in the first place. context.lua's tokenize_fn-nil
|
||
branch handles it (char/4 fallback).
|
||
|
||
A7. **Q-T6 RESOLVED (defer to follow-up)**: tools-schema tokens
|
||
are a fixed cost per session (tools_schema doesn't change unless
|
||
`:mcp connect/disconnect` lands a new session). The under-count
|
||
is bounded and predictable. Defer to a future polish; v1
|
||
counts only messages. Document in §8 out-of-scope.
|
||
|
||
A8. **Per-turn `_tokens` cache invalidation.** Turn `content` is
|
||
immutable after append (we don't mutate stored turns). Cache is
|
||
safe to live forever on the turn. The only invalidation event
|
||
is `:reset` (clears turns wholesale). No other invalidation
|
||
needed.
|
||
|
||
A9. **Probe latency baseline** (Q-T4 deferred): probed manually
|
||
during formulate — single tokenize call for ~50 char text ran
|
||
in ~50ms locally. For 40 turns × 500 chars cached = 40 × 50ms
|
||
= 2s ONLY on the first estimate after a fresh session. After
|
||
caching, subsequent estimates are O(1) per turn (dict lookups).
|
||
|
||
A10. **Streaming during `chat_stream` interleaves with tokenize?**
|
||
No — `Context:estimate_tokens()` is called OUTSIDE the streaming
|
||
callback (in the main loop, before/after broker calls). No
|
||
concurrent network competition.
|
||
|
||
A11. **MCP tool turn content** — `role:"tool"` turns have `content`
|
||
strings too (the tool result). These get tokenized identically;
|
||
no special-case needed. Cache key is the turn dict itself, so
|
||
tool turns get their own `_tokens` slot.
|
||
|
||
A12. **`include_usage` interaction with tokenize**: orthogonal. The
|
||
tokenize probe uses a separate (non-streaming) `/tokenize`
|
||
endpoint; never sees the chat completion's stream_options.
|
||
|
||
PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest
|
||
specifies what Phase 8 adds — **accurate tokenization**: replace the
|
||
char/4 heuristic on `Context:estimate_tokens()` with a per-broker
|
||
`/tokenize` round-trip where supported, char/4 fallback otherwise.
|
||
|
||
Resolves Q1 (`PHASE0.md §13`, originally targeted at Phase 3 — deferred
|
||
forward across each phase). PHASE0 §11 amendment to add Phase 8 row
|
||
lands in the same commit as this formulate doc.
|
||
|
||
---
|
||
|
||
## 1. Scope of Phase 8
|
||
|
||
Five pillars (A1 added pillar 5):
|
||
|
||
1. **Per-endpoint tokenize probe (cached)** — at first use, send a
|
||
probe to the broker's tokenize endpoint with a tiny payload; if it
|
||
returns `{tokens: [...]}` we mark the endpoint+model as tokenize-
|
||
capable and use the actual count thereafter. If it 404s or errors,
|
||
mark the slot as `tokenize_supported = false` and fall through to
|
||
char/4 silently. Cached per `(endpoint, model)` for the session.
|
||
|
||
2. **`broker.token_count(model_cfg, text)`** — thin wrapper that
|
||
returns an accurate token count when the (endpoint, model) is
|
||
tokenize-capable, else the char/4 heuristic. Always returns a
|
||
non-negative integer; never errors. The probe + fallback is
|
||
transparent to callers.
|
||
|
||
3. **`Context:estimate_tokens()` widening** — currently char/4 over
|
||
`system_prompt` + sum of `turn.content`s. The new shape accepts
|
||
an optional `tokenize_fn` (callback) at `Context.new` time and uses
|
||
it when present; falls back to char/4 when nil. `repl.lua` wires
|
||
`tokenize_fn = function(text) return broker.token_count(active_cfg, text) end`.
|
||
This means the active model's tokenizer is used for budgeting
|
||
decisions, which matches the broker the next ask_ai will hit.
|
||
|
||
4. **`:cost detail` estimated-vs-actual column** — for each
|
||
(model, category) slot in the accumulator, the actual
|
||
`prompt_tokens` from broker usage is already stored. Add an
|
||
estimated column computed via `broker.token_count` on the
|
||
currently-buffered prompt-shape. Disagreement >10% surfaces in a
|
||
tiny `~est=N` annotation so users can see when the heuristic
|
||
diverges from reality. Display-only; no behavior change.
|
||
|
||
5. **`enforce_budget` consults `token_budget` (A1)** — currently
|
||
`enforce_budget` only iterates `#turns > max_turns`. Extend to
|
||
ALSO check `estimate_tokens() > token_budget`. Eviction fires
|
||
when EITHER threshold is exceeded; the existing summarize-on-
|
||
evict callback (Phase 5) still gets called per evicted pair.
|
||
This is the real motivation for accurate tokenization — without
|
||
it, the new token counts are display-only. Default budget
|
||
(token_budget = 4096) was set at PHASE0 but never enforced;
|
||
Phase 8 closes that gap.
|
||
|
||
**Phase 8 is done when:**
|
||
|
||
- A long-running session with the local `qwen-coder-7b-snappy-8k`
|
||
model evicts at the RIGHT moment (token_budget=4096 hit triggers
|
||
eviction via the new pillar 5 path) rather than only when
|
||
max_turns is exceeded.
|
||
- `broker.token_count(local_cfg, "hello world")` returns 2 (matches
|
||
the live tokenize result, not the char/4=2 coincidence — verify
|
||
via `:cost detail` against multi-paragraph text).
|
||
- `broker.token_count(cloud_cfg, "hello world")` returns 2 (char/4
|
||
fallback when /tokenize 404s, which it does for OpenRouter).
|
||
- Cached per-endpoint capability — the probe fires once per
|
||
endpoint per session, not per call.
|
||
- Existing configs without `cfg.tokenize` behave like Phase 7 (zero
|
||
behavior change unless opted in via `cfg.tokenize.use_endpoint = true`).
|
||
- `:cost detail` shows estimated-vs-actual where disagreement >10%,
|
||
silent otherwise.
|
||
|
||
---
|
||
|
||
## 2. Technology Decisions (delta from Phase 7)
|
||
|
||
| Decision | Choice | Rationale |
|
||
|---|---|---|
|
||
| Tokenize endpoint path | `<endpoint>/tokenize` (NOT `<endpoint>/v1/tokenize`) | Per real probe against hossenfelder: `/v1/tokenize` returns 404; `/tokenize` returns `{tokens: [...]}`. This is the llama.cpp server convention. |
|
||
| Request body shape | `{"content": "<text>", "model": "<model>"}` | Local model echoed via `model`; llama.cpp ignores it but harmless. Probed shape works. |
|
||
| Capability detection | Per-call optimistic probe; on 404/non-200, cache `tokenize_supported[endpoint][model] = false` and never retry that session | One round-trip cost on first miss; zero on subsequent. Sessions are short enough that re-probe across restarts is fine. |
|
||
| Fallback heuristic | char/4 (Phase 0 §8 convention) | Established; underestimates ~10% on real code/prose per baseline B1, but acceptable when no better signal available. |
|
||
| `Context:estimate_tokens` calling convention | Optional `tokenize_fn` callback at Context.new; absent = char/4 (existing behavior) | Backward-compatible; no caller break. Opt-in via repl.lua. |
|
||
| Active-model tokenizer | repl.lua wires `tokenize_fn` against `active_cfg` (the currently active model), so eviction decisions match the broker the next call will hit | When the user `:model cloud` switches mid-session, subsequent estimates use cloud's tokenizer (which falls back to char/4 since OpenRouter has no /tokenize). |
|
||
| Caching strategy | Endpoint+model capability flag only; NOT per-text token-count cache | Token counts depend on text content; caching adds memory + correctness risk for marginal speed. Probe latency dominates only on first call per endpoint. |
|
||
| Per-text timeout cap | 2s for tokenize calls (much tighter than the model's normal timeout_ms) | Tokenize is a small, fast operation; if it doesn't respond in 2s, the endpoint is misbehaving. Bail to char/4. |
|
||
| `:cost detail` est-vs-actual | Show only when disagreement >10%; format `(prompt: 558 ~est=508 / completion: 80)` for the disagreement case, `(prompt: 558 / completion: 80)` otherwise | Always-on noise; suppress when heuristic is close. |
|
||
| New config key | `cfg.tokenize = { use_endpoint = true }` — default false until user opts in | Network round-trip cost; user-acknowledged behavior change. |
|
||
|
||
---
|
||
|
||
## 3. Module Changes
|
||
|
||
| File | State after Phase 7 | Phase 8 changes |
|
||
|---|---|---|
|
||
| `broker.lua` | `chat`, `chat_stream`, `build_request` (opts-widened in Phase 7) | New `M.token_count(model_cfg, text)`: tries `<endpoint>/tokenize` once per (endpoint, model); caches capability; returns int. New `M.tokenize_supported(model_cfg)` introspection helper for tests. |
|
||
| `context.lua` | `estimate_tokens()` char/4 sum over system_prompt + turn.contents; `enforce_budget()` only checks max_turns | Widen `estimate_tokens` to use `self.tokenize_fn(text)` if present; else char/4. Per-turn `_tokens` cache on each turn dict; lazy-set on first count. Extend `enforce_budget` to ALSO evict when `estimate_tokens() > token_budget` (A1 — pillar 5). |
|
||
| `repl.lua` | wires Context.new with summarize_fn, hosts all metas | `tokenize_fn` wired into Context.new when `cfg.tokenize.use_endpoint = true`. `:cost detail` extended with est-vs-actual column. |
|
||
| `config.lua` | Phase 7 cost block example | Add commented-out `tokenize = { use_endpoint = true }` block. |
|
||
| `docs/PHASE0.md` | §11 lists phases 0-7 | Amendment: add Phase 8 row to §11. |
|
||
|
||
No new module files.
|
||
|
||
---
|
||
|
||
## 4. Pillar 1+2 — `broker.token_count(model_cfg, text)`
|
||
|
||
R6-revised — cache key is endpoint-only (B1: /tokenize ignores the
|
||
model field so two presets sharing an endpoint share one cache entry):
|
||
|
||
```lua
|
||
-- Per-endpoint capability cache (session-scoped local in broker.lua).
|
||
-- Keyed by endpoint only (B1: hossenfelder's /tokenize ignores the
|
||
-- model field; same endpoint -> same tokenization). If a future
|
||
-- broker honors the model field, revisit this keying.
|
||
local _tokenize_capable = {} -- [endpoint] = true | false
|
||
|
||
function M.token_count(model_cfg, text)
|
||
text = text or ""
|
||
if text == "" then return 0 end
|
||
if not (model_cfg and model_cfg.endpoint) then
|
||
return math.floor(#text / 4) -- pure fallback
|
||
end
|
||
local ep = model_cfg.endpoint
|
||
local cap = _tokenize_capable[ep]
|
||
if cap == false then
|
||
return math.floor(#text / 4)
|
||
end
|
||
-- cap == nil OR cap == true; try the endpoint.
|
||
local url = ep:gsub("/+$", "") .. "/tokenize"
|
||
local body = json.encode({ content = text, model = model_cfg.model })
|
||
local out, status = curl.post(url, body,
|
||
{ "Content-Type: application/json" },
|
||
2000) -- 2s timeout
|
||
if not (status == 200 and out) then
|
||
_tokenize_capable[ep] = false
|
||
return math.floor(#text / 4)
|
||
end
|
||
local doc = json.decode(out)
|
||
local toks = doc and doc.tokens
|
||
if type(toks) ~= "table" then
|
||
_tokenize_capable[ep] = false
|
||
return math.floor(#text / 4)
|
||
end
|
||
_tokenize_capable[ep] = true
|
||
return #toks
|
||
end
|
||
|
||
function M.tokenize_supported(model_cfg)
|
||
if not (model_cfg and model_cfg.endpoint) then return nil end
|
||
return _tokenize_capable[model_cfg.endpoint]
|
||
end
|
||
```
|
||
|
||
Uses Phase 1's `ffi/curl.M.post` (blocking POST, returns body + status).
|
||
|
||
---
|
||
|
||
## 5. Pillar 3 — `Context:estimate_tokens` widening
|
||
|
||
R1-revised — cache pattern is IN the reference code, not just prose:
|
||
|
||
```lua
|
||
function M.new(opts)
|
||
...
|
||
return setmetatable({
|
||
...
|
||
-- Phase 8: optional callback that returns an accurate token
|
||
-- count for a given text. Set by repl.lua when cfg.tokenize.
|
||
-- use_endpoint=true, calling broker.token_count(active_cfg, ...).
|
||
-- nil = char/4 fallback (Phase 0 §8 behavior).
|
||
tokenize_fn = opts.tokenize_fn,
|
||
}, Context)
|
||
end
|
||
|
||
function Context:estimate_tokens()
|
||
if self.tokenize_fn then
|
||
-- system_prompt is recomposed per call (memory/project/summary
|
||
-- blocks are dynamic) — re-tokenize every estimate. Bounded
|
||
-- by one round-trip.
|
||
local n = self.tokenize_fn(self.system_prompt)
|
||
-- R1: per-turn cache on the turn dict itself. Turn content
|
||
-- is immutable after append (A8) so the cache never goes
|
||
-- stale; turns dying with :reset takes the cache with them.
|
||
for _, t in ipairs(self.turns) do
|
||
if t._tokens == nil then
|
||
t._tokens = self.tokenize_fn(t.content)
|
||
end
|
||
n = n + t._tokens
|
||
end
|
||
return n
|
||
end
|
||
-- char/4 fallback (existing behavior)
|
||
local n = #self.system_prompt
|
||
for _, t in ipairs(self.turns) do n = n + #t.content end
|
||
return math.floor(n / 4)
|
||
end
|
||
```
|
||
|
||
Performance: first call after a fresh session fires N+1 round-trips
|
||
(N turns + 1 system prompt). Subsequent calls fire 1 (system prompt)
|
||
+ N dict lookups. For N=40, that's 40 × ~30ms = 1.2s one-time + ~30ms
|
||
amortized per call — acceptable for the prompt-template render path
|
||
AND the per-step Norris enforce_budget call.
|
||
|
||
---
|
||
|
||
## 6. Pillar 4 — `:cost detail` est-vs-actual
|
||
|
||
Current `:cost detail` (Phase 7) shows:
|
||
|
||
```
|
||
anthropic/claude-haiku-4.5 main 1 calls, 179 / 8 tokens, $0.000219
|
||
```
|
||
|
||
The `179 / 8` is `prompt_tokens / completion_tokens` SUMMED across all
|
||
calls in that slot — including any turns later evicted from context.
|
||
|
||
R3-revised Phase 8 extension: an inline per-slot "estimated" annotation
|
||
would conflate two different things — the per-slot prompt_tokens is a
|
||
cumulative running total (across calls AND past evicted turns), while
|
||
`estimate_tokens()` is a current-snapshot measurement (in-memory turns
|
||
ONLY). Comparing them directly is misleading; sessions with evictions
|
||
would always show divergence.
|
||
|
||
Instead, add a SINGLE trailing summary line after the slot rows:
|
||
|
||
```
|
||
... per-slot rows ...
|
||
[estimated session ctx: 412 tokens; token_budget=4096 (10% used)]
|
||
```
|
||
|
||
The estimate is `ctx:estimate_tokens()` over the current ctx (system
|
||
prompt + live turns); the percentage gives at-a-glance budget
|
||
utilization. This is purely informational; no annotation on the
|
||
accumulator rows themselves.
|
||
|
||
---
|
||
|
||
## 7. UX Surface Summary
|
||
|
||
| Meta | Behavior change |
|
||
|---|---|
|
||
| `:cost detail` | Adds `~est=N` annotation per slot when heuristic disagreement >10% |
|
||
| (no new metas in v1) | |
|
||
|
||
| Config | Default | Effect |
|
||
|---|---|---|
|
||
| `cfg.tokenize.use_endpoint` | false | When true, repl.lua wires `tokenize_fn` so context budgeting uses real token counts |
|
||
|
||
The `cfg.tokenize` block being opt-in is conservative: enabling it
|
||
means every `Context:estimate_tokens()` call may hit the broker. For
|
||
local llama.cpp the cost is ~50ms; for cloud-only configurations there
|
||
IS no /tokenize endpoint so we silently fall through to char/4 (cached
|
||
after one probe). No surprise; document in config example.
|
||
|
||
---
|
||
|
||
## 8. Out of Scope (Phase 8)
|
||
|
||
- **Cost preflight enforcement** — option 2 of the Phase 7 §12
|
||
candidates. The tokenize work here is a PREREQUISITE for accurate
|
||
preflight cost estimation, but the enforcement layer itself
|
||
(cap_at_dollars that REFUSES the call) is its own surface — defer
|
||
to a separate phase.
|
||
- **Cross-session cost rollup** — option 1 of Phase 7 §12 candidates.
|
||
Independent of tokenization.
|
||
- **Streaming tokenize** — some servers expose streaming tokenize
|
||
endpoints for partial-prompt token counts during generation. Out
|
||
of scope here; we use the blocking /tokenize for batch estimates.
|
||
- **Multi-tokenizer support** (e.g. tiktoken for OpenAI compat,
|
||
sentencepiece for HuggingFace) — would require vendoring a C library
|
||
(violates PHASE0 §3) or shelling out to python. Endpoint-based is
|
||
the only substrate-compliant option for accuracy beyond char/4.
|
||
- **Tokenization for `:cost detail` rows that span multiple turns**
|
||
— the actual `prompt_tokens` in the accumulator slot is the sum
|
||
ACROSS calls; the estimate for comparison should be over the
|
||
CURRENT ctx content. Show the per-call comparison only.
|
||
|
||
---
|
||
|
||
## 9. Risks
|
||
|
||
| Risk | Mitigation |
|
||
|---|---|
|
||
| `/tokenize` 404 silently cached as `tokenize_supported = false` for a typo'd endpoint config | Per-session cache; restart re-probes. Acceptable. |
|
||
| Tokenize round-trip on every prompt eviction check adds 50ms × N turns latency | `turn._tokens` per-turn cache set at append-time; only re-tokenize on cache miss. |
|
||
| Hossenfelder proxy may forward `/tokenize` differently than direct llama.cpp (e.g., adds `/v1/` prefix expected) | B1 confirms `/tokenize` works against hossenfelder; other proxies untested but the design degrades gracefully (char/4 fallback). |
|
||
| Cloud models without /tokenize emit no probes after first 404 — fine but `:cost detail` est-vs-actual will always agree (both are char/4 then) | Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. |
|
||
| `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
|
||
| Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
|
||
| A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. |
|
||
| R5 — 2s tokenize timeout could spuriously cache-as-unsupported when the llama.cpp backend is busy with a concurrent completion (single-threaded inference, /tokenize queues behind it). Once cached false, char/4 takes over for the rest of the session even though the endpoint IS capable. | 2s is fine for idle responses; bumping to 5s or making it configurable (`cfg.tokenize.timeout_ms`) is a v1.1 polish if it bites in practice. Documented; revisit during verify. |
|
||
|
||
---
|
||
|
||
## 10. Open Questions (Phase 8)
|
||
|
||
| # | Question | Impact | Resolution target |
|
||
|---|---|---|---|
|
||
| Q-T1 | Per-turn `_tokens` cache across `:reset` | A4 — dies with turns; new turns get nil and lazy-set on first count. Trivial. |
|
||
| Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. |
|
||
| Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. |
|
||
| Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. |
|
||
| Q-T5 | `/tokenize` honors `model` field | B1 RESOLVED — `/tokenize` IGNORES the model field; returns the loaded backend's tokenization. Acceptable (BPE >> char/4 even with wrong tokenizer); cache key simplified to endpoint-only per R6. |
|
||
| Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. |
|
||
|
||
---
|
||
|
||
## 11. Phase 8 → Phase 9+ Out-of-band
|
||
|
||
Candidate follow-ups (non-binding):
|
||
|
||
- **Phase 9**: cost preflight enforcement (Phase 7 §12 option 2) —
|
||
uses Phase 8's accurate token counts to refuse calls that would
|
||
cross `cap_at_dollars`. The accuracy work here is the foundation.
|
||
- **Cross-session cost rollup** (Phase 7 §12 option 1) — independent;
|
||
could land in parallel.
|
||
- **Phase X**: project-local config overlay (`.aish.lua`) — was the
|
||
alternative scope to Phase 7's cost work. Still valuable but
|
||
independent of any current line.
|
||
|
||
Phase 8 itself is self-contained — no upstream dependencies.
|
||
|
||
---
|
||
|
||
## 13. Implementation Plan (commit-by-commit)
|
||
|
||
Bottom-up: broker first (the egress capability all callers depend
|
||
on), then context (the consumer + the new pillar 5 budget extension),
|
||
then repl.lua wiring + display, then config + status bump. Each
|
||
commit leaves the tree green (existing tests + load smoke + per-
|
||
commit feature smoke).
|
||
|
||
### Order
|
||
|
||
1. **`broker.lua` — `M.token_count` helper + per-endpoint capability cache.**
|
||
- Module-local `_tokenize_capable` table keyed by `endpoint .. "/" .. model`.
|
||
- `M.token_count(model_cfg, text)`:
|
||
- empty text -> 0
|
||
- bad cfg (no endpoint) -> char/4 immediately
|
||
- capability cache says `false` for this slot -> char/4
|
||
- otherwise: probe `<endpoint>/tokenize` with `{content, model}` body,
|
||
2s timeout. On `status == 200 + parseable {tokens=[...]}`:
|
||
cache `true`, return `#tokens`. Anything else (non-200, parse
|
||
fail, transport err): cache `false`, char/4.
|
||
- `M.tokenize_supported(model_cfg)` returns the cache slot for
|
||
introspection (tests + future :tokenize meta).
|
||
- Smoke: hand-call `M.token_count(local_cfg, "hello world")` -> 2;
|
||
`M.token_count(cloud_cfg, "hello world")` -> 2 (char/4 fallback;
|
||
cache marks cloud as unsupported on first try).
|
||
|
||
2. **`context.lua` — estimate_tokens widening + per-turn cache.**
|
||
- Context.new accepts `opts.tokenize_fn` -> stored as `self.tokenize_fn`.
|
||
- `Context:estimate_tokens()`:
|
||
- if `tokenize_fn` is nil: existing char/4 (no behavior change).
|
||
- else: tokenize `system_prompt` (no caching — system prompt
|
||
changes per turn due to dynamic blocks).
|
||
For each turn: if `turn._tokens` is set use it; else
|
||
compute via tokenize_fn AND cache on turn._tokens.
|
||
- No new helper; the change is internal to estimate_tokens.
|
||
- Smoke: synthetic Context with stub tokenize_fn that returns N=42
|
||
for every call; verify estimate sums correctly + cache populates
|
||
turn._tokens.
|
||
|
||
3. **`context.lua` — enforce_budget honors token_budget (pillar 5).**
|
||
- Existing `while #self.turns > self.max_turns` loop extended.
|
||
**R2 guard** — when system_prompt alone exceeds budget AND
|
||
turns are empty, the loop must exit (not spin trying to evict
|
||
nothing). Correct condition:
|
||
```lua
|
||
while (#self.turns > self.max_turns
|
||
or self:estimate_tokens() > self.token_budget)
|
||
and #self.turns > 0 do
|
||
```
|
||
Lowercase `or`/`and` per Lua syntax (N1).
|
||
- Per-pair eviction otherwise unchanged (summarize callback,
|
||
status_evictions).
|
||
- The estimate_tokens call inside the loop is potentially expensive
|
||
under tokenize_fn — but commit #2's per-turn cache means each
|
||
iteration is O(#turns) dict-lookups after the first. Acceptable
|
||
for the eviction hot path.
|
||
- Smoke: (a) Context with `token_budget = 100`, max_turns = 100,
|
||
fill with turns until `estimate_tokens() > 100`, then call
|
||
enforce_budget — should evict until under budget. (b) R2 case:
|
||
synthetic system_prompt of 500 chars (char/4 = 125 tokens) +
|
||
token_budget = 100 + zero turns — call enforce_budget; must
|
||
return immediately, not spin.
|
||
|
||
4. **`repl.lua` — tokenize_fn wiring + :cost detail estimate row.**
|
||
- When `config.tokenize and config.tokenize.use_endpoint`, build
|
||
`ctx_opts.tokenize_fn = function(text)
|
||
return broker.token_count(active_cfg, text)
|
||
end`. **R4: the closure body MUST reference `active_cfg`
|
||
directly as an upvalue, NOT capture it by value** (`local cfg
|
||
= active_cfg; return function() ... cfg ... end` would freeze
|
||
to the value at closure-construction time and miss `:model`
|
||
switches). A5 verified upvalue semantics in Lua.
|
||
- `:cost detail` extension per R3: ONE trailing summary line under
|
||
the existing per-slot rows showing
|
||
`[estimated session ctx: N tokens; token_budget=M (X% used)]`.
|
||
N comes from `ctx:estimate_tokens()` (current snapshot, NOT a
|
||
comparison against the accumulator sum — they measure different
|
||
things). M is `ctx.token_budget`. X% = N/M × 100.
|
||
- Smoke: with use_endpoint=true on a local-only session, observe
|
||
enforce_budget eviction timing vs disabled; observe :cost detail
|
||
estimate row updates as turns accumulate.
|
||
|
||
5. **`config.lua` example block + `docs/PHASE8.md` status bump.**
|
||
- Commented-out `tokenize = { use_endpoint = true }` block in
|
||
config.lua with parity to Phase 1-7 example blocks. Document
|
||
the per-endpoint network cost (one probe per session) and the
|
||
implication: token_budget actually enforces now.
|
||
- PHASE8.md status header -> **Implement**.
|
||
|
||
### Risk index per commit
|
||
|
||
| Commit | Risk | Mitigation |
|
||
|---|---|---|
|
||
| 1 (broker) | Per-endpoint cache leaks across model_cfg deletions (e.g., user removes a model from config mid-session) | Cache is keyed by string; stale entries don't grow without bound (bounded by #configured models × 1). No GC needed. |
|
||
| 1 (broker) | /tokenize probe blocks the calling thread for 2s on a misconfigured endpoint | 2s timeout is the cap; one-shot per endpoint per session. |
|
||
| 2 (context) | per-turn `_tokens` cache miss on every estimate when no tokenize_fn -> existing perf preserved | Cache check is conditional on tokenize_fn presence; char/4 path untouched. |
|
||
| 3 (context) | enforce_budget loop now calls estimate_tokens potentially every iteration; with tokenize_fn that's O(#turns) per iteration -> O(#turns^2) worst case | Per-turn cache makes this O(#turns) amortized after first fill. For typical max_turns=40 + token_budget=4096 sessions: ~40^2 dict lookups = 1600 ops in worst case, microsecond cost. |
|
||
| 3 (context) | accurate counts mean token_budget=4096 (Phase 0 default) finally ENFORCES — sessions that fit under char/4 may now evict earlier | Documented in §9; user can raise token_budget to match their model's real context window. |
|
||
| 4 (repl) | tokenize_fn closure binding to `active_cfg` upval — if upval somehow gets reassigned wrong, eviction uses wrong tokenizer | Lua upvalues are call-time-resolved; A5 verified. Test by smoke after `:model` switch. |
|
||
| 5 (config + status) | none | |
|
||
|
||
### Tests + smoke per commit
|
||
|
||
Each commit:
|
||
- Pass `luajit test_safety.lua` (87/87) and `luajit test_router_model.lua` (31/31)
|
||
- Load cleanly via `luajit -e 'package.path=...; require("repl"); print("ok")'`
|
||
- Pass a per-feature smoke (described in each row above)
|
||
|
||
### Things deliberately NOT split
|
||
|
||
- New module file for tokenize — small enough to live in broker.lua.
|
||
- Per-text token cache (in addition to per-turn): not needed; turn
|
||
content is immutable post-append.
|
||
- :tokenize meta for introspecting the cache — `M.tokenize_supported`
|
||
is exported for testing; if a user needs runtime visibility, that's
|
||
a follow-up.
|
||
|
||
### Open at plan-time (resolve at implement)
|
||
|
||
- :cost detail layout — how exactly to show "estimated session ctx"
|
||
relative to the existing per-slot rows. Pick at commit 4 (likely
|
||
a single trailing line under the detail table).
|
||
- Whether to expose `:tokenize <text>` for direct-probe debugging.
|
||
Nice-to-have; defer unless useful during verify.
|