From aa64ad3eecff312ab2b28b5d58a806afdf6170ae Mon Sep 17 00:00:00 2001
From: Markus Fritsche <mfritsche@reauktion.de>
Date: Sat, 16 May 2026 23:24:41 +0000
Subject: [PATCH] =?UTF-8?q?docs/PHASE8:=20plan=20=E2=80=94=20=C2=A713=20co?=
 =?UTF-8?q?mmit=20roadmap=20(5=20commits)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Status: Analyze -> Plan. All open Qs resolved (Q-T5 via baseline B1).

5-commit roadmap, bottom-up:

  1. broker.lua — M.token_count helper + per-endpoint capability
     cache. <endpoint>/tokenize probe with 2s timeout; cache true/false
     per (endpoint, model) for the session. char/4 fallback on any
     non-200 / parse-fail / transport err. M.tokenize_supported
     introspection helper.

  2. context.lua — Context.new accepts opts.tokenize_fn; estimate_
     tokens widens to use it when set, with per-turn `_tokens` cache.
     char/4 path unchanged when tokenize_fn nil.

  3. context.lua — enforce_budget consults token_budget too (pillar
     5 from A1). Loop condition: turns>max_turns OR estimate_tokens
     >token_budget. Existing summarize-on-evict callback unchanged.

  4. repl.lua — wire tokenize_fn when cfg.tokenize.use_endpoint=true.
     Closure captures active_cfg upval (A5 — follows :model switches
     naturally). :cost detail extension: trailing line showing
     estimated session ctx tokens for comparison with the per-slot
     prompt_tokens sums in the accumulator.

  5. config.lua commented `tokenize = { use_endpoint = true }`
     example + PHASE8.md status -> Implement.

Per-commit risk index covers: probe latency cap (2s, one-shot),
per-turn cache correctness (immutable post-append), enforce_budget
performance (O(N) per call after cache fill), and the intentional
behavior change of token_budget actually being enforced (sessions
fitting under char/4 may evict earlier under accurate counts —
documented in §9).

Two items open at plan, resolve at implement:
  - exact :cost detail layout for estimated session ctx row
  - whether to add a :tokenize debug meta (defer unless useful in verify)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/PHASE8.md | 119 ++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 118 insertions(+), 1 deletion(-)
diff --git a/docs/PHASE8.md b/docs/PHASE8.md
index 59c85d4..4978935 100644
--- a/docs/PHASE8.md
+++ b/docs/PHASE8.md
@@ -2,7 +2,7 @@
 
 **Project:** aish — AI-augmented conversational shell
 **Document:** Phase 8 Requirements, Architecture & Design Decisions
-**Status:** Analyze (formulate complete; tree at `00869ba` probed)
+**Status:** Plan (formulate + analyze + baseline complete; tree at `79bd40d`)
 **Date:** 2026-05-16
 
 **Analyze findings (2026-05-16):**
@@ -396,3 +396,120 @@ Candidate follow-ups (non-binding):
   independent of any current line.
 
 Phase 8 itself is self-contained — no upstream dependencies.
+
+---
+
+## 13. Implementation Plan (commit-by-commit)
+
+Bottom-up: broker first (the egress capability all callers depend
+on), then context (the consumer + the new pillar 5 budget extension),
+then repl.lua wiring + display, then config + status bump. Each
+commit leaves the tree green (existing tests + load smoke + per-
+commit feature smoke).
+
+### Order
+
+1. **`broker.lua` — `M.token_count` helper + per-endpoint capability cache.**
+   - Module-local `_tokenize_capable` table keyed by `endpoint .. "/" .. model`.
+   - `M.token_count(model_cfg, text)`:
+     - empty text -> 0
+     - bad cfg (no endpoint) -> char/4 immediately
+     - capability cache says `false` for this slot -> char/4
+     - otherwise: probe `<endpoint>/tokenize` with `{content, model}` body,
+       2s timeout. On `status == 200 + parseable {tokens=[...]}`:
+       cache `true`, return `#tokens`. Anything else (non-200, parse
+       fail, transport err): cache `false`, char/4.
+   - `M.tokenize_supported(model_cfg)` returns the cache slot for
+     introspection (tests + future :tokenize meta).
+   - Smoke: hand-call `M.token_count(local_cfg, "hello world")` -> 2;
+     `M.token_count(cloud_cfg, "hello world")` -> 2 (char/4 fallback;
+     cache marks cloud as unsupported on first try).
+
+2. **`context.lua` — estimate_tokens widening + per-turn cache.**
+   - Context.new accepts `opts.tokenize_fn` -> stored as `self.tokenize_fn`.
+   - `Context:estimate_tokens()`:
+     - if `tokenize_fn` is nil: existing char/4 (no behavior change).
+     - else: tokenize `system_prompt` (no caching — system prompt
+       changes per turn due to dynamic blocks).
+       For each turn: if `turn._tokens` is set use it; else
+       compute via tokenize_fn AND cache on turn._tokens.
+   - No new helper; the change is internal to estimate_tokens.
+   - Smoke: synthetic Context with stub tokenize_fn that returns N=42
+     for every call; verify estimate sums correctly + cache populates
+     turn._tokens.
+
+3. **`context.lua` — enforce_budget honors token_budget (pillar 5).**
+   - Existing `while #self.turns > self.max_turns` loop extended to:
+     `while #self.turns > self.max_turns OR self:estimate_tokens() > self.token_budget do`.
+   - Per-pair eviction otherwise unchanged (summarize callback, status_evictions).
+   - The estimate_tokens call inside the loop is potentially expensive
+     under tokenize_fn — but commit #2's per-turn cache means each
+     iteration is O(#turns) dict-lookups. Acceptable for the eviction
+     hot path.
+   - Smoke: Context with `token_budget = 100`, max_turns = 100, fill
+     with turns until `estimate_tokens() > 100`, then call enforce_budget
+     — should evict until under budget. With char/4: eviction happens
+     at known char counts. With tokenize_fn stub returning 50/turn:
+     evicts down to <= 100/50 = 2 turns.
+
+4. **`repl.lua` — tokenize_fn wiring + :cost detail est-vs-actual.**
+   - When `config.tokenize and config.tokenize.use_endpoint`, build
+     `ctx_opts.tokenize_fn = function(text)
+        return broker.token_count(active_cfg, text)
+      end`. The closure captures `active_cfg` upval per A5 — follows
+     `:model` switches naturally.
+   - `:cost detail` extension: per slot, compute the sum of actual
+     `prompt_tokens` (already in usage_totals) AND an "estimated"
+     count via... actually this needs a rethink: the accumulator
+     doesn't store per-turn data, just running sums. The estimate
+     for comparison would need to be over ctx's CURRENT content
+     (a single snapshot, not per-call). Simplest: show a SECOND
+     line under `:cost detail` with the current `estimate_tokens()`
+     value, labeled "[estimated session ctx: N tokens]". Disagreement
+     with the accumulator's prompt_tokens TOTAL surfaced inline.
+   - Smoke: with use_endpoint=true on a local-only session, observe
+     enforce_budget eviction timing vs disable; observe :cost detail
+     estimate row.
+
+5. **`config.lua` example block + `docs/PHASE8.md` status bump.**
+   - Commented-out `tokenize = { use_endpoint = true }` block in
+     config.lua with parity to Phase 1-7 example blocks. Document
+     the per-endpoint network cost (one probe per session) and the
+     implication: token_budget actually enforces now.
+   - PHASE8.md status header -> **Implement**.
+
+### Risk index per commit
+
+| Commit | Risk | Mitigation |
+|---|---|---|
+| 1 (broker) | Per-endpoint cache leaks across model_cfg deletions (e.g., user removes a model from config mid-session) | Cache is keyed by string; stale entries don't grow without bound (bounded by #configured models × 1). No GC needed. |
+| 1 (broker) | /tokenize probe blocks the calling thread for 2s on a misconfigured endpoint | 2s timeout is the cap; one-shot per endpoint per session. |
+| 2 (context) | per-turn `_tokens` cache miss on every estimate when no tokenize_fn -> existing perf preserved | Cache check is conditional on tokenize_fn presence; char/4 path untouched. |
+| 3 (context) | enforce_budget loop now calls estimate_tokens potentially every iteration; with tokenize_fn that's O(#turns) per iteration -> O(#turns^2) worst case | Per-turn cache makes this O(#turns) amortized after first fill. For typical max_turns=40 + token_budget=4096 sessions: ~40^2 dict lookups = 1600 ops in worst case, microsecond cost. |
+| 3 (context) | accurate counts mean token_budget=4096 (Phase 0 default) finally ENFORCES — sessions that fit under char/4 may now evict earlier | Documented in §9; user can raise token_budget to match their model's real context window. |
+| 4 (repl) | tokenize_fn closure binding to `active_cfg` upval — if upval somehow gets reassigned wrong, eviction uses wrong tokenizer | Lua upvalues are call-time-resolved; A5 verified. Test by smoke after `:model` switch. |
+| 5 (config + status) | none | |
+
+### Tests + smoke per commit
+
+Each commit:
+- Pass `luajit test_safety.lua` (87/87) and `luajit test_router_model.lua` (31/31)
+- Load cleanly via `luajit -e 'package.path=...; require("repl"); print("ok")'`
+- Pass a per-feature smoke (described in each row above)
+
+### Things deliberately NOT split
+
+- New module file for tokenize — small enough to live in broker.lua.
+- Per-text token cache (in addition to per-turn): not needed; turn
+  content is immutable post-append.
+- :tokenize meta for introspecting the cache — `M.tokenize_supported`
+  is exported for testing; if a user needs runtime visibility, that's
+  a follow-up.
+
+### Open at plan-time (resolve at implement)
+
+- :cost detail layout — how exactly to show "estimated session ctx"
+  relative to the existing per-slot rows. Pick at commit 4 (likely
+  a single trailing line under the detail table).
+- Whether to expose `:tokenize <text>` for direct-probe debugging.
+  Nice-to-have; defer unless useful during verify.