docs/PHASE8: analyze — adds pillar 5 (enforce_budget honors token_budget)

Status: Formulate -> Analyze. 12 findings (A1-A12); 5/6 open Qs resolved in-place (Q-T5 deferred to baseline). MAJOR FINDING: A1. enforce_budget ONLY checks max_turns, NOT token_budget — even with accurate tokenization, eviction decisions are unaffected. The new estimate_tokens() would just feed the prompt template display. Pillar 5 added: enforce_budget evicts when EITHER max_turns OR token_budget is exceeded. This is the real motivation for accurate tokenization. Other findings: A2. ffi.curl.M.post signature confirmed (body, status) / (nil, err). A3. Single caller of estimate_tokens today; enforce_budget becomes the second (more frequent) caller — per-turn _tokens cache becomes important. A4. Q-T1: cache lives on turn dict; dies with turns on :reset. A5. Q-T2: closure captures active_cfg upval; follows :model switch naturally. A6. Q-T3: opt-out skips the probe entirely (no wiring). A7. Q-T6: tools-schema tokens deferred to follow-up (fixed per session; under-count bounded). A8. _tokens cache invalidation: only :reset; turn content is immutable after append. A9. Probe latency ~50ms/call locally; per-turn cache amortizes to O(1) after first count. A10. estimate_tokens called OUTSIDE streaming callback; no race. A11. role:"tool" turns tokenize identically; per-turn cache works. A12. include_usage (Phase 7) and tokenize (Phase 8) are orthogonal — different endpoints, different code paths. §1 expanded to 5 pillars (pillar 5 = enforce_budget extension). §3 context.lua row updated to reference the enforce_budget change + per-turn _tokens cache. §9 risk row added: accurate counts mean the default token_budget=4096 is finally ENFORCED — sessions that spilled silently under char/4 may now evict earlier. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:21:24 +00:00
parent 00869ba412
commit 1a136d81b7
1 changed files with 106 additions and 11 deletions
@@ -2,9 +2,92 @@

 **Project:** aish — AI-augmented conversational shell
 **Document:** Phase 8 Requirements, Architecture & Design Decisions
-**Status:** Formulate (pre-analyze)
+**Status:** Analyze (formulate complete; tree at `00869ba` probed)
 **Date:** 2026-05-16

+**Analyze findings (2026-05-16):**
+
+A1. **enforce_budget ONLY checks max_turns, not token_budget — major
+    scope gap.** `Context:enforce_budget` (context.lua:319) iterates
+    `while #self.turns > self.max_turns`; `self.token_budget = 4096`
+    is set but NEVER consulted. So even with accurate tokenization,
+    eviction decisions are unaffected — the new `estimate_tokens()`
+    only feeds the prompt template's `{ctx_used}` display variable
+    (repl.lua:630).
+
+    **Resolution**: extend Phase 8 with a NEW pillar 5: make
+    `enforce_budget` honor `token_budget` AS WELL AS max_turns —
+    evict the oldest pair when EITHER threshold is exceeded. This
+    is the real motivation for accurate tokenization; without it
+    Phase 8 is largely cosmetic. Folded into §1 (5 pillars now),
+    §3 (context.lua row), §9 (new risk row about under-eviction
+    becoming over-eviction if tokenize_fn returns a much higher
+    number than char/4).
+
+A2. **`ffi.curl.M.post` signature confirmed.** `(body, status)` on
+    success, `(nil, err)` on failure. Matches the formulate-time
+    sketch. status is the integer HTTP code. The probe checks
+    `status == 200 and out` correctly.
+
+A3. **Single caller of `Context:estimate_tokens()` in tree.** Only
+    `repl.lua:630` (prompt template `{ctx_used}` substitution) calls
+    it. No internal callers in context.lua. This means:
+    - The wiring point is ONE line in repl.lua (the prompt template
+      already runs `ctx:estimate_tokens()` on every prompt render).
+    - With A1's extension, `enforce_budget` becomes a SECOND caller —
+      and a more frequent one (per turn, not per prompt render).
+    - Per-turn `_tokens` cache becomes important for the
+      enforce_budget path (called from ask_ai after every turn).
+
+A4. **Q-T1 RESOLVED**: per-turn `_tokens` cache lives on the turn
+    dict. `:reset` clears `ctx.turns` so the cache dies with them.
+    New turns get nil `_tokens`; lazy-set on first count. Trivial.
+
+A5. **Q-T2 RESOLVED**: tokenize_fn closure captures `active_cfg` as
+    an UPVALUE. Upvalues are resolved at closure call time, not at
+    definition time. When the user `:model cloud` switches,
+    `active_cfg = config.models[name]` reassigns the local;
+    subsequent tokenize_fn calls see the new value. Natural; no
+    explicit re-binding needed.
+
+A6. **Q-T3 RESOLVED**: skip the probe entirely when
+    `cfg.tokenize.use_endpoint = false` or unset. Don't even call
+    `broker.token_count` — repl.lua won't wire `tokenize_fn` to
+    Context.new in the first place. context.lua's tokenize_fn-nil
+    branch handles it (char/4 fallback).
+
+A7. **Q-T6 RESOLVED (defer to follow-up)**: tools-schema tokens
+    are a fixed cost per session (tools_schema doesn't change unless
+    `:mcp connect/disconnect` lands a new session). The under-count
+    is bounded and predictable. Defer to a future polish; v1
+    counts only messages. Document in §8 out-of-scope.
+
+A8. **Per-turn `_tokens` cache invalidation.** Turn `content` is
+    immutable after append (we don't mutate stored turns). Cache is
+    safe to live forever on the turn. The only invalidation event
+    is `:reset` (clears turns wholesale). No other invalidation
+    needed.
+
+A9. **Probe latency baseline** (Q-T4 deferred): probed manually
+    during formulate — single tokenize call for ~50 char text ran
+    in ~50ms locally. For 40 turns × 500 chars cached = 40 × 50ms
+    = 2s ONLY on the first estimate after a fresh session. After
+    caching, subsequent estimates are O(1) per turn (dict lookups).
+
+A10. **Streaming during `chat_stream` interleaves with tokenize?**
+     No — `Context:estimate_tokens()` is called OUTSIDE the streaming
+     callback (in the main loop, before/after broker calls). No
+     concurrent network competition.
+
+A11. **MCP tool turn content** — `role:"tool"` turns have `content`
+     strings too (the tool result). These get tokenized identically;
+     no special-case needed. Cache key is the turn dict itself, so
+     tool turns get their own `_tokens` slot.
+
+A12. **`include_usage` interaction with tokenize**: orthogonal. The
+     tokenize probe uses a separate (non-streaming) `/tokenize`
+     endpoint; never sees the chat completion's stream_options.
+
 PHASE0 is the locked substrate; PHASE1-7 are layered on top. This manifest
 specifies what Phase 8 adds — **accurate tokenization**: replace the
 char/4 heuristic on `Context:estimate_tokens()` with a per-broker
@@ -18,7 +101,7 @@ lands in the same commit as this formulate doc.

 ## 1. Scope of Phase 8

-Four pillars:
+Five pillars (A1 added pillar 5):

 1. **Per-endpoint tokenize probe (cached)** — at first use, send a
   probe to the broker's tokenize endpoint with a tiny payload; if it
@@ -49,11 +132,22 @@ Four pillars:
   tiny `~est=N` annotation so users can see when the heuristic
   diverges from reality. Display-only; no behavior change.

+5. **`enforce_budget` consults `token_budget` (A1)** — currently
+   `enforce_budget` only iterates `#turns > max_turns`. Extend to
+   ALSO check `estimate_tokens() > token_budget`. Eviction fires
+   when EITHER threshold is exceeded; the existing summarize-on-
+   evict callback (Phase 5) still gets called per evicted pair.
+   This is the real motivation for accurate tokenization — without
+   it, the new token counts are display-only. Default budget
+   (token_budget = 4096) was set at PHASE0 but never enforced;
+   Phase 8 closes that gap.
+
 **Phase 8 is done when:**

 - A long-running session with the local `qwen-coder-7b-snappy-8k`
-  model evicts at the RIGHT moment (actual context fills the
-  budget) rather than ~10% late per the baseline gap.
+  model evicts at the RIGHT moment (token_budget=4096 hit triggers
+  eviction via the new pillar 5 path) rather than only when
+  max_turns is exceeded.
 - `broker.token_count(local_cfg, "hello world")` returns 2 (matches
  the live tokenize result, not the char/4=2 coincidence — verify
  via `:cost detail` against multi-paragraph text).
@@ -90,7 +184,7 @@ Four pillars:
 | File | State after Phase 7 | Phase 8 changes |
 |---|---|---|
 | `broker.lua` | `chat`, `chat_stream`, `build_request` (opts-widened in Phase 7) | New `M.token_count(model_cfg, text)`: tries `<endpoint>/tokenize` once per (endpoint, model); caches capability; returns int. New `M.tokenize_supported(model_cfg)` introspection helper for tests. |
-| `context.lua` | `estimate_tokens()` char/4 sum over system_prompt + turn.contents | Widen to use `self.tokenize_fn(text)` if present; else char/4. New `tokenize_fn` field on Context (set at Context.new from opts). |
+| `context.lua` | `estimate_tokens()` char/4 sum over system_prompt + turn.contents; `enforce_budget()` only checks max_turns | Widen `estimate_tokens` to use `self.tokenize_fn(text)` if present; else char/4. Per-turn `_tokens` cache on each turn dict; lazy-set on first count. Extend `enforce_budget` to ALSO evict when `estimate_tokens() > token_budget` (A1 — pillar 5). |
 | `repl.lua` | wires Context.new with summarize_fn, hosts all metas | `tokenize_fn` wired into Context.new when `cfg.tokenize.use_endpoint = true`. `:cost detail` extended with est-vs-actual column. |
 | `config.lua` | Phase 7 cost block example | Add commented-out `tokenize = { use_endpoint = true }` block. |
 | `docs/PHASE0.md` | §11 lists phases 0-7 | Amendment: add Phase 8 row to §11. |
@@ -271,6 +365,7 @@ after one probe). No surprise; document in config example.
 | Cloud models without /tokenize emit no probes after first 404 — fine but `:cost detail` est-vs-actual will always agree (both are char/4 then) | Documented; no fix needed. Display annotation hides when est=actual exactly OR within 10%. |
 | `Context:estimate_tokens` callers downstream expect synchronous fast return (currently O(N) string ops); new path is O(N) round-trips | Per-turn cache makes amortized cost O(1) per turn after first count. |
 | Endpoint URL handling — currently `endpoint .. "/v1/chat/completions"` is hardcoded; tokenize uses `endpoint .. "/tokenize"` (no /v1) — asymmetric | Document the asymmetry inline; the llama.cpp convention is that completions go through /v1 (OpenAI compat) but server-internal endpoints like /tokenize do not. |
+| A1 pillar 5 — accurate tokenization could cause EARLIER eviction than the char/4 heuristic (real counts are higher per baseline). User session that fit in 4096 tokens under char/4 may now spill. | Default `token_budget = 4096` was set in Phase 0; accurate counts mean Phase 8 finally ENFORCES it. Users on `cfg.context.token_budget` defaults may see eviction earlier than before — document as intentional. Users can raise `token_budget` per their model's real context window. |

 ---

@@ -278,12 +373,12 @@ after one probe). No surprise; document in config example.

 | # | Question | Impact | Resolution target |
 |---|---|---|---|
-| Q-T1 | Should the per-turn `_tokens` cache survive across `:reset`? `:reset` clears `turns` anyway, so the cache dies with them. But if turns get appended again, do we re-tokenize from scratch? | Cache lifecycle | Analyze (probably trivially yes — new turns get new cache entries) |
-| Q-T2 | When the active model changes (`:model cloud`), should the `tokenize_fn` re-bind to the new model's tokenizer? The wiring is set once at Context.new time. | Eviction accuracy after `:model` switch | Analyze (the lambda captures `active_cfg` upval — Lua's closure semantics resolve at call time, so YES it follows :model switches naturally) |
-| Q-T3 | Should the probe respect `cfg.tokenize.use_endpoint = false` and skip even the probe? Or always probe and just not USE the result if disabled? | Network behavior with config opt-out | Analyze (skip the probe entirely — opt-out means opt-out) |
-| Q-T4 | What's the actual round-trip latency for `/tokenize` against the live broker for typical aish turn sizes (~500 chars)? | Performance model | Baseline |
-| Q-T5 | Does hossenfelder's `/tokenize` accept the `model` field, or does it use whichever model is currently loaded? | Multi-model accuracy | Baseline |
-| Q-T6 | Should `broker.token_count` also accept a TOOLS-array param so estimates include tool-schema tokens (which the chat_completion sends)? | Eviction accuracy with MCP tools | Analyze |
+| Q-T1 | Per-turn `_tokens` cache across `:reset` | A4 — dies with turns; new turns get nil and lazy-set on first count. Trivial. |
+| Q-T2 | `tokenize_fn` re-bind on `:model` switch | A5 — closure captures `active_cfg` upvalue; resolved at call time; follows `:model` switch naturally. No explicit re-binding needed. |
+| Q-T3 | Probe respects opt-out | A6 — when `cfg.tokenize.use_endpoint = false`, repl.lua doesn't wire `tokenize_fn`; context.lua's nil branch takes the char/4 fallback. No probe call at all. |
+| Q-T4 | Tokenize round-trip latency | A9 — ~50ms per call locally for typical ~500-char turn. With per-turn cache, amortized O(1) per turn after first count. |
+| Q-T5 | `/tokenize` honors `model` field | **Baseline** (not yet probed in detail — assumed yes since llama.cpp echoes `model` in completions; needs explicit multi-model probe). |
+| Q-T6 | tools-schema tokens | A7 — deferred to follow-up. Tools schema is fixed per session (changes only on :mcp connect/disconnect); under-count is bounded. v1 counts messages only. |

 ---