# aish — Phase 7 Manifest **Project:** aish — AI-augmented conversational shell **Document:** Phase 7 Requirements, Architecture & Design Decisions **Status:** Analyze (formulate complete; tree at `3bad07b` probed) **Date:** 2026-05-16 **Analyze findings (2026-05-16):** A1. **broker.chat_stream surface is clean for the extension.** The existing `on_event(data)` closure inside `M.chat_stream` already parses `doc.error` / `doc.choices` / `delta` / tool_calls — adding `if doc.usage then final_usage = ... end` is one block. Emission happens via a closure-local `final_usage` that the post-loop code in `chat_stream` reads and calls `on_delta("usage", final_usage)` on. `build_request` needs minor extension OR (cleaner) `chat_stream` inserts `stream_options.include_usage = true` into the body table AFTER `json.encode` — but we currently encode in `build_request`. Cleanest: extend `build_request(model_cfg, messages, stream, opts)` so it can read `opts.include_usage`. Phase 7 simplifies the signature in passing. A2. **7 caller sites** identified for `opts.category` threading: | Site | Category | |---|---| | `safety.lua:191` (LLM probe) | `"probe"` | | `safety.lua:354` (norris main) | `"norris"` | | `repl.lua:326` (summarize-on-evict) | `"summarize"` | | `repl.lua:685` (call_broker wrapper, used by ask_ai) | `"main"` | | `repl.lua:1104` (DELEGATE: handler) | `"delegate"` | | `repl.lua:1587` (:memory summarize) | `"memory_summarize"` | | `repl.lua:2156` (:delegate meta) | `"delegate"` | All callers pass `opts` already; adding a `category` field is additive and backward-compatible (default to `"main"` when absent). A3. **`build_request` signature simplification.** Today it takes `(model_cfg, messages, stream, tools, max_tokens)` — five positional args. With Phase 7 needing `include_usage` AND `stream_options`, positional growth gets unwieldy. **Resolution:** widen to `(model_cfg, messages, stream, opts)` where opts carries `{tools, max_tokens, include_usage, stream_options}`. Callers in `M.chat_stream` and `M.chat` pass their existing opts table through. This is a refactor but contained inside broker.lua. A4. **Q-C3 RESOLVED: free-form categories.** The closed-set vs free-form debate resolved in favor of free-form per the helpers/skills convention already in place (Phase 6 :tree / :diff metas don't validate sub-args either). `:cost detail` will show whatever categories appear — small + documented closed set in practice (7 entries from A2), no surprise. A5. **Q-C5 RESOLVED: warn fires on the call that crossed.** The crossed call's usage IS in the accumulator at the moment we check (we check AFTER `add_usage`). Firing on the NEXT call would mean a delay of one full broker round-trip before the user sees the warn — defeats the purpose. Just emit-on-cross. A6. **Q-C6 RESOLVED: `:reset` does NOT clear `cost_warn_fired`.** Parity with `usage_totals` itself (per the §2 decision row); the user reset their conversation, not their cost meter. The flag AND the totals are reset only by the explicit `:cost reset` verb. A7. **Norris call-graph rewires (existing safety.lua:354 path):** with issue #52 wired (commit `955bd82`), the Norris broker call now passes `helpers.scrub_msgs` / `helpers.streaming_rehydrator`. The on_delta wrapping pattern means I need to be careful that the new `("usage", payload)` kind also flows through any wrapper. Since secrets streaming_rehydrator only matches on `kind == "text"`, the "usage" kind passes through unchanged. No new entanglement. A8. **`ctx.usage_totals` survives `:reset` per R8** — same invariant as `memory_items` (Phase 4) and `project` (Phase 6). Documented in §5 of the manifest; reinforces the "ambient context survives conversation reset" rule. A9. **Session JSONL serialization** — assistant turn dict gets an optional `usage` field. `history.lua` log_turn currently calls `json.encode(turn)` opaquely; the dkjson serializer handles nested tables. No code change needed; the new field flows through automatically when the assistant turn carries one. A10. **Q-C1 PARTIAL: local providers may not emit `usage`.** The formulate-time assumption was "treat absence as zero-cost / unknown". A real probe against `qwen-coder-7b-snappy-8k` is a baseline action — see B-probes below. The implementation will be defensive: if `doc.usage` never appears in the stream, no "usage" event is emitted, and the accumulator is unchanged for that turn. `:cost` output naturally reflects "0 calls counted for local model" if that's the case. A11. **Q-C4 deferred to baseline**: actual `stream_options` forwarding by the hossenfelder proxy must be probed against a live broker. If the proxy strips the option, we get no `usage` events even for cloud calls. Baseline action. PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest specifies what Phase 7 adds — **cost / usage observability**: the ability to know, mid-session, how many tokens you've spent and how much money the paid-cloud calls have cost. PHASE0 §11 originally listed phases only through 6; this commit amends §11 to add Phase 7. --- ## 1. Scope of Phase 7 Four pillars: 1. **Usage capture in broker** — `broker.chat_stream` extracts the provider's `usage` block (and `cost` where present) from the response stream. Surfaces it to the caller via a new `on_delta("usage", ...)` kind. The existing `broker.chat` buffering wrapper exposes it as a second return value `(text, usage)`. Backward-compatible: callers that don't handle the new kind / second value simply ignore it. 2. **Per-session accumulator on `ctx`** — running totals per-model AND per-call-category (main / delegate / summarize / probe) accumulate on `ctx.usage_totals`. No persistence across sessions in v1 (Q-C2 defers cross-session); the session-log JSONL files DO carry per-turn usage so historical analysis is possible after the fact. 3. **`:cost` meta** — a `:cost` reporter that shows the current session totals, with optional `:cost detail` for the per-model + per-category breakdown. Zero broker calls (purely local read of `ctx.usage_totals`). 4. **Optional warning thresholds** — `cfg.cost.warn_at_dollars` and `cfg.cost.warn_at_tokens` emit a status the first time the running total crosses the configured threshold. Default off (no warnings without config). Useful when cloud presets are configured and you want a "you've spent $1 this session" nudge before runaway cost. **Phase 7 is done when:** - `broker.chat_stream` exposes usage via the new `on_delta("usage", ...)` callback kind; `broker.chat` returns `(text, usage)`. Backward compat preserved (no existing caller breaks). - After a session with mixed local + cloud calls, `:cost` prints a total like: ``` [aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens cost=$0.0234 (cloud only; local: 0) ``` - `:cost detail` breaks down by model + category: ``` fast main: 14 turns, 8200/2100 tokens cloud main: 8 turns, 3850/980 tokens, $0.0180 cloud delegate: 1 turn, 250/80 tokens, $0.0012 cloud probe: 1 turn, 150/30 tokens, $0.0042 ``` - Session JSONL gains a `usage` field on assistant turns (when the broker returned one). - With `cfg.cost.warn_at_dollars = 0.50` set, crossing $0.50 cumulative emits exactly one status line. - Existing configs without `cfg.cost` behave exactly like Phase 6 (Phase 6 regression coverage). --- ## 2. Technology Decisions (delta from Phase 6) | Decision | Choice | Rationale | |---|---|---| | Where to extract usage | In `broker.chat_stream` event loop, looking at each SSE event's `usage` field on the final chunk | The OpenAI streaming spec puts `usage` on the FINAL chunk when `stream_options: { include_usage: true }` is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline). | | New on_delta kind | `on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? })` | Mirrors the existing `("text", chunk)` / `("tool_call", call)` shape. Callers ignore unknown kinds; backward-compatible. | | Where to enable usage on the wire | `opts.include_usage = true` (default `true`) sets `stream_options.include_usage = true` in the outbound request body | Off-switch for hosts that reject `stream_options`. Defaults on; baseline probe confirms current broker tolerates it. (A3: `build_request` signature widens to take an `opts` table; positional growth was getting unwieldy.) | | Accumulator location | `ctx.usage_totals[model_name][category]` table | ctx is per-conversation; matches the `:reset`-survives-or-not rules already in place. | | Categories | `"main"` (ask_ai), `"delegate"`, `"summarize"`, `"memory_summarize"`, `"probe"`, `"norris"` | One-tag-per-call-site. Tagged at the caller site (caller passes `opts.category` to `broker.chat_stream`). | | Cost extraction | `usage.cost` (OpenRouter convention; dollars as a number) plus `usage.cost_details.upstream_inference_cost` (more detailed). For Anthropic/Bedrock the cost arrives in dollars on `usage.cost`. For pure local llama.cpp: no `cost` field — record 0. | Single field name across all observed providers (per baseline B7 — to be confirmed). | | Cost precision | Store as `number` (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) | No floating-point cumulative-error concerns at this scale. | | Warning trigger | First crossing of either threshold emits a single status: `[aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY`. Crossed-flag stored on ctx; reset only on session end / `:cost reset`. | One-shot to avoid spamming. | | `:reset` interaction | `:reset` does NOT clear `ctx.usage_totals` (parity with `memory_items`/`project`) — the user reset their conversation, not their cost tracking. `:cost reset` is the explicit reset verb. | Matches R8 invariant from Phase 6. | | Session-log persistence | Assistant turn entries gain an optional `usage` field when broker returned one. `history.lua` log_turn writes it through verbatim. | Per-turn granularity preserved for after-the-fact analysis. No new file. | --- ## 3. Module Changes | File | State after Phase 6 | Phase 7 changes | |---|---|---| | `broker.lua` | `chat_stream(cfg, msgs, on_delta, opts)` with text + tool_call kinds; `chat` returns text | Extract usage from final SSE chunk; emit `on_delta("usage", payload)`; `chat` returns `(text, usage)`. New `opts.include_usage` (default true); new `opts.category` (passed through as a tag in the usage payload). | | `context.lua` | system prompt + turns + memory + project + summary | Add `self.usage_totals` (table) + `self.cost_warn_fired` (bool). New helpers: `Context:add_usage(model, category, usage)`, `Context:total_cost()`, `Context:total_tokens()`. `Context:reset` does NOT clear `usage_totals` (parity with memory_items / project per R8). | | `repl.lua` | ask_ai + delegate + summarize callbacks + Norris helpers | Wire `opts.category` at each broker call site (main / delegate / summarize / memory_summarize). Wire `on_delta("usage", ...)` -> `ctx:add_usage(...)`. New `:cost` and `:cost detail` / `:cost reset` metas. Cost-warn check after each `add_usage` call. | | `safety.lua` | norris_step + is_destructive | Pass `opts.category = "norris"` (for the main chat_stream call) and `"probe"` (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since `safety.llm_model = "cloud"` is the recommended setting. | | `history.lua` | session.log_turn appends JSONL entries | log_turn already takes turn opaquely; assistant turns will carry `usage` if present and it'll serialize via dkjson. No code change unless filter desired. | | `config.lua` | example blocks for mcp/safety/memory/routing/secrets/hooks/project | Add commented-out `cost = { warn_at_dollars, warn_at_tokens }` block. | | `docs/PHASE0.md` | §11 lists phases 0-6 | **Amendment**: add Phase 7 row to §11. | No new module files. --- ## 4. Pillar 1 — Usage capture in broker ### SSE shape (provider-by-provider — confirm in baseline) For OpenAI-compatible streams with `stream_options: { include_usage: true }`: ```json data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]} data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]} data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}} data: [DONE] ``` The final usage event arrives AFTER `finish_reason` but BEFORE `[DONE]`. `choices` is empty `[]` on the usage event. For non-streaming `chat`: usage is in the response body at the top level. broker.chat is a wrapper around chat_stream, so it inherits the on_delta path. For local llama.cpp via hossenfelder: usage may or may not be present depending on the proxy's version. Treat absence as zero-cost / unknown. ### Extraction algorithm ```lua local final_usage = nil local function on_event(data) ... if doc.usage then -- Provider sent usage; capture for emission after the stream. final_usage = { prompt_tokens = doc.usage.prompt_tokens or 0, completion_tokens = doc.usage.completion_tokens or 0, total_tokens = doc.usage.total_tokens or 0, cost = doc.usage.cost, -- nil for local model = doc.model or model_cfg.model, } -- Don't emit yet — the [DONE] event marks stream end; emit -- once we exit the curl.post_sse loop so the caller sees -- usage as the LAST event in the stream order. end -- ... existing text + tool_call handling ... end -- After curl.post_sse returns (stream complete): if final_usage then on_delta("usage", final_usage) end ``` ### Outbound include_usage ```lua local body_table = { model = ..., messages = ..., stream = true } if opts.include_usage ~= false then body_table.stream_options = { include_usage = true } end ``` Risk: some providers reject unrecognized fields. Baseline check; if any host throws on `stream_options`, the per-model opt-out is one line. ### Category tagging `opts.category` is a string set by the caller. broker echoes it into the emitted usage payload so the accumulator knows what to credit. Default category if absent: `"main"`. --- ## 5. Pillar 2 — Accumulator on ctx ### Shape ```lua ctx.usage_totals = { -- [model_name] = { [category] = { prompt = N, completion = N, -- calls = N, cost = N } } fast = { main = { prompt = 1234, completion = 567, calls = 14, cost = 0 }, }, cloud = { main = { prompt = 3850, completion = 980, calls = 8, cost = 0.0180 }, delegate = { prompt = 250, completion = 80, calls = 1, cost = 0.0012 }, probe = { prompt = 150, completion = 30, calls = 1, cost = 0.0042 }, }, } ctx.cost_warn_fired = false ``` ### add_usage ```lua function Context:add_usage(model, category, u) model = model or "?" category = category or "main" self.usage_totals = self.usage_totals or {} local m = self.usage_totals[model] or {} local c = m[category] or { prompt = 0, completion = 0, calls = 0, cost = 0 } c.prompt = c.prompt + (u.prompt_tokens or 0) c.completion = c.completion + (u.completion_tokens or 0) c.calls = c.calls + 1 c.cost = c.cost + (u.cost or 0) m[category] = c self.usage_totals[model] = m end function Context:total_cost() local total = 0 for _, m in pairs(self.usage_totals or {}) do for _, c in pairs(m) do total = total + c.cost end end return total end function Context:total_tokens() local p, comp = 0, 0 for _, m in pairs(self.usage_totals or {}) do for _, c in pairs(m) do p = p + c.prompt comp = comp + c.completion end end return p, comp end ``` ### Reset semantics `Context:reset()` deliberately does NOT clear `usage_totals` — matches R8 invariant from Phase 6 (`:reset` clears `turns`, `pending_exec_output`, `summary`; preserves `memory_items`, `project`, and now `usage_totals`). The user reset their conversation, not their cost meter. `:cost reset` is the explicit reset verb for the meter. --- ## 6. Pillar 3 — `:cost` meta ``` :cost summary line :cost detail per-model + per-category breakdown :cost reset zero out ctx.usage_totals + cost_warn_fired ``` Summary format: ``` [aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens cost=$0.0234 (cloud only; local: 0) ``` Detail format (sorted by total cost desc, then by model): ``` [aish] session usage detail: cloud main 8 calls, 3,850 / 980 tokens, $0.0180 cloud delegate 1 call, 250 / 80 tokens, $0.0012 cloud probe 1 call, 150 / 30 tokens, $0.0042 fast main 14 calls, 8,200 / 2,100 tokens, $0 (local) ``` Implementation: pure Lua iteration over `ctx.usage_totals`; no broker calls. Sorting uses `table.sort` on a flattened list. --- ## 7. Pillar 4 — Warning thresholds Config: ```lua cost = { warn_at_dollars = 0.50, -- emit once when cumulative cost crosses warn_at_tokens = 100000, -- emit once when cumulative tokens crosses } ``` After every `ctx:add_usage`, check: ```lua if config.cost and not ctx.cost_warn_fired then local cost = ctx:total_cost() if config.cost.warn_at_dollars and cost >= config.cost.warn_at_dollars then renderer.status(("session cost $%.4f has crossed warn_at_dollars=$%.4f") :format(cost, config.cost.warn_at_dollars)) ctx.cost_warn_fired = true end -- (similar for warn_at_tokens; share the flag or use two) end ``` One-shot per session. `:cost reset` clears the flag. --- ## 8. UX Surface Summary | Meta | Behavior | |---|---| | `:cost` | One-line summary: calls / tokens / cost | | `:cost detail` | Per-model + per-category breakdown | | `:cost reset` | Zero out totals + clear warn-fired flag | | Config | Default | Effect | |---|---|---| | `cfg.cost.warn_at_dollars` | nil | Status when cumulative cost first crosses this dollar amount | | `cfg.cost.warn_at_tokens` | nil | Status when cumulative total tokens first crosses | | (broker `opts.include_usage`) | true | Adds `stream_options.include_usage = true` to outbound request | --- ## 9. Out of Scope (Phase 7) - **Cross-session cost persistence** — Q-C2 defers `/cost.jsonl` rollup; v1 is session-only. Per-turn usage IS in the session JSONL for after-the-fact aggregation if anyone wants to script it. - **Per-model rate limiting / cost caps that REFUSE the call** — v1 only warns. A future phase could add a hard cap that aborts before the broker call. - **Pricing-table fallback for local models** — if a local model doesn't emit `usage.cost`, we record 0. Estimating cost from token count + a static pricing table is a future polish (most users won't care about local "cost" anyway — local is free). - **Pretty token-bandwidth charts / sparklines** — out of scope; the detail breakdown is text-only. - **Estimated cost for future turns** — no preflight cost prediction. - **MCP tool-call usage** — MCP servers don't expose token usage; broker calls invoked DURING MCP tool dispatch ARE captured (because they go through the same path), but the MCP tool call itself isn't. --- ## 10. Risks | Risk | Mitigation | |---|---| | Some providers reject `stream_options` -> SSE errors at the top of the stream | `opts.include_usage = false` opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior | | OpenRouter `cost` field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) | Capture `usage.cost` as-is (number); document that the same provider must be used for cross-call comparison | | Local llama.cpp returns no `cost` -> displayed `$0` could mislead user "is this REALLY free?" | `:cost detail` annotates local lines with `(local)` literal; summary says `cost=$X (cloud only; local: 0)` | | `ctx.usage_totals` grows unboundedly with new model names mid-session | Bounded by `#models in config` × `#categories` — small constants. No mitigation needed. | | Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold | Acceptable for v1; user can `:cost reset` to re-arm. Future polish: warn at each Nx multiple. | --- ## 11. Open Questions (Phase 7) | # | Question | Impact | Resolution target | |---|---|---|---| | Q-C1 | Provider-without-usage handling | A10 — defensive silent skip; baseline probe will confirm shape on local llama.cpp. | | Q-C2 | Cross-session cost persistence (`cost.jsonl`) | Deferred to follow-up phase 8; v1 is session-only. | | Q-C3 | Categories closed-set vs free-form | A4 — **free-form**; caller decides. Matches Phase 6 helpers/skills convention. | | Q-C4 | `stream_options` forwarding by hossenfelder | **Baseline** — probe required against the live broker. | | Q-C5 | Warn fires on the crossed call or the next | A5 — **on the crossed call** (no UX-defeating delay). | | Q-C6 | `:reset` clears `cost_warn_fired` | A6 — **no**, only `:cost reset` clears the flag (R8 parity). | --- ## 12. Phase 7 → Phase 8+ Out-of-band Candidate follow-ups (non-binding): - **Phase 8**: cross-session cost persistence (Q-C2 deferral), with optional cost dashboards / weekly rollup reporter. - **Hard rate limits / cost caps that REFUSE the call** — an extension of the warn surface that promotes warnings into preflight enforcement. - **Better tokenization** (Q1 deferred-from-Phase-3): replace the char/4 heuristic on `Context:estimate_tokens()` with model `/tokenize` calls. Indirectly improves accuracy of any future "preflight cost predictor". Phase 7 itself is self-contained — no upstream dependencies.