Files
aish/docs/PHASE7.md
T
marfrit 1f34b6dce8 config + docs/PHASE7: example block + status -> Implement (Phase 7 commit #6)
R9-resolved single-owner of the status bump (commit #5 didn't touch
PHASE7.md). N5: PHASE0 §11 amendment landed in commit 3bad07b
(formulate); not re-applied here.

config.lua:
  - Commented-out `cost = { warn_at_dollars, warn_at_tokens }` block
    with parity to the Phase 1-6 example blocks.
  - Notes warn flags are independent (R4) and per-turn usage flows
    to session/*.jsonl for after-the-fact analysis.

docs/PHASE7.md:
  - Status header bumped: "Plan + review fold-in" -> "Implement"
  - Lists the 6 implement commits inline for traceability:
      7364963  broker: usage capture + opts widening
      7b4a9be  context: accumulator helpers
      8adebd5  repl: _record_usage + opts.category at 5 sites
      b30212a  safety + repl: opts.category for Norris + probe
      0d6ff93  repl: :cost meta surface
      this     config example + status bump

Phase 7 implementation is complete. Next inner-loop step is verify
(7) — user-driven smoke tests, then memory-update (8).

Regression: test_safety 87/87, test_router_model 31/31, repl loads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 23:02:55 +00:00

804 lines
41 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# aish — Phase 7 Manifest
**Project:** aish — AI-augmented conversational shell
**Document:** Phase 7 Requirements, Architecture & Design Decisions
**Status:** Implement (6 commits landed: 7364963, 7b4a9be, 8adebd5, b30212a, 0d6ff93, this)
**Date:** 2026-05-16
**Review findings (independent Sonnet agent, 2026-05-16) — 3 BLOCKERs
resolved in-place, 6 CONCERNs folded, 5 NITs applied:**
R1 (BLOCKER, RESOLVED). **`M.chat` would silently return `(text, nil)`
for ALL non-streaming callers.** `M.chat`'s internal on_delta only
captures `kind == "text"`. Without explicit handling of
`kind == "usage"`, four out of five categories that go through
`broker.chat` (summarize / delegate / memory_summarize / probe)
would report zero usage even after a cloud round-trip. **Fix
folded into §4 + §13 commit 1:** M.chat's on_delta also captures
the usage payload and returns it as the second value.
R2 (BLOCKER, RESOLVED). **`call_broker` fallback retry — usage
payload's `model` field credits the WRONG model name.** The
`wrapped` on_delta in call_broker is closed over the PRIMARY's
name; if the wrapped function uses an outer-scope `model_name`
variable to key the accumulator, the fallback's usage gets
misattributed. **Resolution:** the broker emits `payload.model =
model_cfg.model` (which IS the fallback's model when called with
`fb_cfg` — chat_stream's local upvar). The wrapper keys by
`payload.model`, NOT by the outer `model_name`. Documented in
§4 emission code + §13 commit 3 (wrapped on_delta uses
`payload.model` for accumulator keying).
R3 (BLOCKER, RESOLVED — promoted to docs). **`build_request` has
TWO internal callers inside broker.lua itself**, not just the
public surface. Migration is contained but both internal sites
must be updated in commit 1. Plan §13 commit 1 risk row updated
to call this out explicitly so the implementer doesn't read
"every caller already passes opts" as "only external callers
need touching".
R4 (CONCERN, FOLDED). **Single `cost_warn_fired` flag for two
thresholds is broken.** When both warn_at_dollars AND
warn_at_tokens are configured, the first-to-fire suppresses the
other. **Fix:** `ctx.cost_warn_fired` becomes `ctx.cost_warn_state
= { dollars = false, tokens = false }`. Each threshold has its
own flag; `:cost reset` clears both. §7 pseudocode updated.
R5 (CONCERN, FOLDED). **Warn-check centralization decided:** use a
single `_record_usage(model, category, usage)` helper inside
repl.lua that wraps `ctx:add_usage` AND does the threshold check
AND calls renderer.status when crossed. `context.lua` stays
decoupled from `renderer`. safety.lua call sites get
`helpers.on_usage = _record_usage` in the helpers table; probe
callsite gets `opts.on_usage = _record_usage`. Single chokepoint
for the warn check. §3 + §7 + §13 commits 3-5 reflect.
R6 (CONCERN, FOLDED). **`nil` vs `0` cost distinction must be
preserved at the accumulator level.** Local-model `$0` (no cost
field) vs cloud-call-that-happens-to-cost-zero need to be
distinguishable for `:cost detail` annotation. **Fix:** accumulator
slot gains `is_local = true` when ANY recorded usage for that
slot had `cost == nil`. Cloud calls with `cost = 0` (rare) stay
annotated as cloud. §5 pseudocode + §6 annotation logic updated.
R7 (CONCERN, FOLDED). **`:cost detail` sort needs three-level key
for determinism.** Lua's `table.sort` is unstable; equal-cost
rows would have arbitrary order. **Fix:** sort key is
`(cost desc, model asc, category asc)`. §6 updated.
R8 (CONCERN, FOLDED). **`call_broker` fallback passes `opts.include_usage`
unchanged.** Documented as a known assumption (B1 confirms both
backends accept; if a future fallback host rejects, the call-site
can pass `include_usage = false` explicitly). §10 risk row added.
R9 (CONCERN, FOLDED). **`:resume` does NOT restore historical
`usage_totals`.** Per-turn usage IS in the session JSONL but
`:resume` reloads turns for conversation continuity only; the
accumulator stays empty. Documented in §8 surface notes; users
who want cross-session totals can script the jsonl or wait for
the deferred Q-C2 follow-up.
R10 (CONCERN, FOLDED). **`$%.4f` loses sub-cent precision.** A
`0.000028` cloud cost displays as `$0.0000` — indistinguishable
from `$0` local. **Fix:** format strings widened to `$%.6f` in
§6 (and the warn message in §7). 6 decimal places accommodates
the smallest observed real cost.
R-N1..N5 (NITs, APPLIED):
N1. §4 extraction pseudocode gains a comment noting the
`if doc.usage` branch is INDEPENDENT of the choice branch and
must be checked regardless of choice nil-ness (handles both
B2 emission shapes).
N2. §2 "Cost extraction" row referenced stale "B7"; corrected to B3.
N3. §13 commit 3 row gains an explicit dependency note: commit 3's
"capture the new second return value" requires commit 1's M.chat
fix from R1 to ship first.
N4. §3 safety.lua row + §13 commit 4 row spell out the signature
chain: `llm_probe``llm_second_opinion``M.is_destructive`
all widen to thread `opts.on_usage` through.
N5. §3 PHASE0.md row + §13 commit 6 row — the PHASE0 §11 amendment
is ALREADY in tree (committed at `3bad07b` with the formulate
doc). Commit 6 should NOT re-apply; only adds config.lua block
+ bumps PHASE7 status header.
**Analyze findings (2026-05-16):**
A1. **broker.chat_stream surface is clean for the extension.** The
existing `on_event(data)` closure inside `M.chat_stream` already
parses `doc.error` / `doc.choices` / `delta` / tool_calls — adding
`if doc.usage then final_usage = ... end` is one block. Emission
happens via a closure-local `final_usage` that the post-loop code
in `chat_stream` reads and calls `on_delta("usage", final_usage)`
on. `build_request` needs minor extension OR (cleaner) `chat_stream`
inserts `stream_options.include_usage = true` into the body table
AFTER `json.encode` — but we currently encode in `build_request`.
Cleanest: extend `build_request(model_cfg, messages, stream, opts)`
so it can read `opts.include_usage`. Phase 7 simplifies the
signature in passing.
A2. **7 caller sites** identified for `opts.category` threading:
| Site | Category |
|---|---|
| `safety.lua:191` (LLM probe) | `"probe"` |
| `safety.lua:354` (norris main) | `"norris"` |
| `repl.lua:326` (summarize-on-evict) | `"summarize"` |
| `repl.lua:685` (call_broker wrapper, used by ask_ai) | `"main"` |
| `repl.lua:1104` (DELEGATE: handler) | `"delegate"` |
| `repl.lua:1587` (:memory summarize) | `"memory_summarize"` |
| `repl.lua:2156` (:delegate meta) | `"delegate"` |
All callers pass `opts` already; adding a `category` field is
additive and backward-compatible (default to `"main"` when absent).
A3. **`build_request` signature simplification.** Today it takes
`(model_cfg, messages, stream, tools, max_tokens)` — five positional
args. With Phase 7 needing `include_usage` AND `stream_options`,
positional growth gets unwieldy. **Resolution:** widen to
`(model_cfg, messages, stream, opts)` where opts carries
`{tools, max_tokens, include_usage, stream_options}`. Callers in
`M.chat_stream` and `M.chat` pass their existing opts table through.
This is a refactor but contained inside broker.lua.
A4. **Q-C3 RESOLVED: free-form categories.** The closed-set vs free-form
debate resolved in favor of free-form per the helpers/skills
convention already in place (Phase 6 :tree / :diff metas don't
validate sub-args either). `:cost detail` will show whatever
categories appear — small + documented closed set in practice
(7 entries from A2), no surprise.
A5. **Q-C5 RESOLVED: warn fires on the call that crossed.** The crossed
call's usage IS in the accumulator at the moment we check (we
check AFTER `add_usage`). Firing on the NEXT call would mean a
delay of one full broker round-trip before the user sees the
warn — defeats the purpose. Just emit-on-cross.
A6. **Q-C6 RESOLVED: `:reset` does NOT clear `cost_warn_fired`.**
Parity with `usage_totals` itself (per the §2 decision row); the
user reset their conversation, not their cost meter. The flag
AND the totals are reset only by the explicit `:cost reset` verb.
A7. **Norris call-graph rewires (existing safety.lua:354 path):** with
issue #52 wired (commit `955bd82`), the Norris broker call now
passes `helpers.scrub_msgs` / `helpers.streaming_rehydrator`. The
on_delta wrapping pattern means I need to be careful that the new
`("usage", payload)` kind also flows through any wrapper. Since
secrets streaming_rehydrator only matches on `kind == "text"`, the
"usage" kind passes through unchanged. No new entanglement.
A8. **`ctx.usage_totals` survives `:reset` per R8** — same invariant
as `memory_items` (Phase 4) and `project` (Phase 6). Documented in
§5 of the manifest; reinforces the "ambient context survives
conversation reset" rule.
A9. **Session JSONL serialization** — assistant turn dict gets an
optional `usage` field. `history.lua` log_turn currently calls
`json.encode(turn)` opaquely; the dkjson serializer handles nested
tables. No code change needed; the new field flows through
automatically when the assistant turn carries one.
A10. **Q-C1 PARTIAL: local providers may not emit `usage`.** The
formulate-time assumption was "treat absence as zero-cost / unknown".
A real probe against `qwen-coder-7b-snappy-8k` is a baseline
action — see B-probes below. The implementation will be defensive:
if `doc.usage` never appears in the stream, no "usage" event is
emitted, and the accumulator is unchanged for that turn. `:cost`
output naturally reflects "0 calls counted for local model" if
that's the case.
A11. **Q-C4 deferred to baseline**: actual `stream_options` forwarding
by the hossenfelder proxy must be probed against a live broker.
If the proxy strips the option, we get no `usage` events even
for cloud calls. Baseline action.
PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest
specifies what Phase 7 adds — **cost / usage observability**: the ability
to know, mid-session, how many tokens you've spent and how much money the
paid-cloud calls have cost.
PHASE0 §11 originally listed phases only through 6; this commit amends
§11 to add Phase 7.
---
## 1. Scope of Phase 7
Four pillars:
1. **Usage capture in broker**`broker.chat_stream` extracts the
provider's `usage` block (and `cost` where present) from the response
stream. Surfaces it to the caller via a new `on_delta("usage", ...)`
kind. The existing `broker.chat` buffering wrapper exposes it as a
second return value `(text, usage)`. Backward-compatible: callers
that don't handle the new kind / second value simply ignore it.
2. **Per-session accumulator on `ctx`** — running totals per-model AND
per-call-category (main / delegate / summarize / probe) accumulate on
`ctx.usage_totals`. No persistence across sessions in v1 (Q-C2
defers cross-session); the session-log JSONL files DO carry per-turn
usage so historical analysis is possible after the fact.
3. **`:cost` meta** — a `:cost` reporter that shows the current session
totals, with optional `:cost detail` for the per-model + per-category
breakdown. Zero broker calls (purely local read of `ctx.usage_totals`).
4. **Optional warning thresholds**`cfg.cost.warn_at_dollars` and
`cfg.cost.warn_at_tokens` emit a status the first time the running
total crosses the configured threshold. Default off (no warnings
without config). Useful when cloud presets are configured and you
want a "you've spent $1 this session" nudge before runaway cost.
**Phase 7 is done when:**
- `broker.chat_stream` exposes usage via the new `on_delta("usage", ...)`
callback kind; `broker.chat` returns `(text, usage)`. Backward compat
preserved (no existing caller breaks).
- After a session with mixed local + cloud calls, `:cost` prints a
total like:
```
[aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens
cost=$0.0234 (cloud only; local: 0)
```
- `:cost detail` breaks down by model + category:
```
fast main: 14 turns, 8200/2100 tokens
cloud main: 8 turns, 3850/980 tokens, $0.0180
cloud delegate: 1 turn, 250/80 tokens, $0.0012
cloud probe: 1 turn, 150/30 tokens, $0.0042
```
- Session JSONL gains a `usage` field on assistant turns (when the
broker returned one).
- With `cfg.cost.warn_at_dollars = 0.50` set, crossing $0.50 cumulative
emits exactly one status line.
- Existing configs without `cfg.cost` behave exactly like Phase 6
(Phase 6 regression coverage).
---
## 2. Technology Decisions (delta from Phase 6)
| Decision | Choice | Rationale |
|---|---|---|
| Where to extract usage | In `broker.chat_stream` event loop, looking at each SSE event's `usage` field on the final chunk | The OpenAI streaming spec puts `usage` on the FINAL chunk when `stream_options: { include_usage: true }` is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline). |
| New on_delta kind | `on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? })` | Mirrors the existing `("text", chunk)` / `("tool_call", call)` shape. Callers ignore unknown kinds; backward-compatible. |
| Where to enable usage on the wire | `opts.include_usage = true` (default `true`) sets `stream_options.include_usage = true` in the outbound request body | Off-switch for hosts that reject `stream_options`. Defaults on; baseline probe confirms current broker tolerates it. (A3: `build_request` signature widens to take an `opts` table; positional growth was getting unwieldy.) |
| Accumulator location | `ctx.usage_totals[model_name][category]` table | ctx is per-conversation; matches the `:reset`-survives-or-not rules already in place. |
| Categories | `"main"` (ask_ai), `"delegate"`, `"summarize"`, `"memory_summarize"`, `"probe"`, `"norris"` | One-tag-per-call-site. Tagged at the caller site (caller passes `opts.category` to `broker.chat_stream`). |
| Cost extraction | `usage.cost` (OpenRouter convention; dollars as a number). For Anthropic/Bedrock the cost arrives in dollars on `usage.cost`. For pure local llama.cpp: no `cost` field — record as nil (R6 — preserves the local-vs-cloud-zero distinction in the accumulator). | Single field name across observed providers per baseline B3. |
| Cost precision | Store as `number` (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) | No floating-point cumulative-error concerns at this scale. |
| Warning trigger | First crossing of either threshold emits a single status: `[aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY`. Crossed-flag stored on ctx; reset only on session end / `:cost reset`. | One-shot to avoid spamming. |
| `:reset` interaction | `:reset` does NOT clear `ctx.usage_totals` (parity with `memory_items`/`project`) — the user reset their conversation, not their cost tracking. `:cost reset` is the explicit reset verb. | Matches R8 invariant from Phase 6. |
| Session-log persistence | Assistant turn entries gain an optional `usage` field when broker returned one. `history.lua` log_turn writes it through verbatim. | Per-turn granularity preserved for after-the-fact analysis. No new file. |
---
## 3. Module Changes
| File | State after Phase 6 | Phase 7 changes |
|---|---|---|
| `broker.lua` | `chat_stream(cfg, msgs, on_delta, opts)` with text + tool_call kinds; `chat` returns text | Extract usage from final SSE chunk; emit `on_delta("usage", payload)`; `chat` returns `(text, usage)`. New `opts.include_usage` (default true); new `opts.category` (passed through as a tag in the usage payload). |
| `context.lua` | system prompt + turns + memory + project + summary | Add `self.usage_totals` (table) + `self.cost_warn_fired` (bool). New helpers: `Context:add_usage(model, category, usage)`, `Context:total_cost()`, `Context:total_tokens()`. `Context:reset` does NOT clear `usage_totals` (parity with memory_items / project per R8). |
| `repl.lua` | ask_ai + delegate + summarize callbacks + Norris helpers | Wire `opts.category` at each broker call site (main / delegate / summarize / memory_summarize). Wire `on_delta("usage", ...)` -> `ctx:add_usage(...)`. New `:cost` and `:cost detail` / `:cost reset` metas. Cost-warn check after each `add_usage` call. |
| `safety.lua` | norris_step + is_destructive | Pass `opts.category = "norris"` (for the main chat_stream call) and `"probe"` (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since `safety.llm_model = "cloud"` is the recommended setting. |
| `history.lua` | session.log_turn appends JSONL entries | log_turn already takes turn opaquely; assistant turns will carry `usage` if present and it'll serialize via dkjson. No code change unless filter desired. |
| `config.lua` | example blocks for mcp/safety/memory/routing/secrets/hooks/project | Add commented-out `cost = { warn_at_dollars, warn_at_tokens }` block. |
| `docs/PHASE0.md` | §11 lists phases 0-6 | Amendment landed at `3bad07b` (formulate commit). N5: commit 6 does NOT re-apply. |
No new module files.
---
## 4. Pillar 1 — Usage capture in broker
### SSE shape (provider-by-provider — confirm in baseline)
For OpenAI-compatible streams with `stream_options: { include_usage: true }`:
```json
data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]}
data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]}
data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}}
data: [DONE]
```
The final usage event arrives AFTER `finish_reason` but BEFORE `[DONE]`.
`choices` is empty `[]` on the usage event.
For non-streaming `chat`: usage is in the response body at the top level.
broker.chat is a wrapper around chat_stream, so it inherits the on_delta
path.
For local llama.cpp via hossenfelder: usage may or may not be present
depending on the proxy's version. Treat absence as zero-cost / unknown.
### Extraction algorithm
```lua
local final_usage = nil
local function on_event(data)
...
-- N1: this branch is INDEPENDENT of the choice branch below;
-- check unconditionally. Per B2, local emits usage on a
-- choices=[] chunk (choice nil); cloud emits on a non-empty
-- choices chunk (with finish_reason). Both shapes funnel here.
if doc.usage then
-- R2: payload.model is ALWAYS the caller-stable model_cfg.model
-- (chat_stream's local upvar). When called via call_broker's
-- fallback retry, this naturally reflects the fallback's
-- model name — wrapper callers can key by payload.model
-- without tracking primary-vs-fallback themselves.
final_usage = {
prompt_tokens = doc.usage.prompt_tokens or 0,
completion_tokens = doc.usage.completion_tokens or 0,
total_tokens = doc.usage.total_tokens or 0,
-- R6: keep nil-vs-0 distinction at this layer; the
-- accumulator decides how to tag local-vs-cloud-zero.
cost = doc.usage.cost, -- nil for local
model = model_cfg.model, -- caller-stable per B4
category = opts.category or "main",
}
-- Don't emit yet — the [DONE] event marks stream end; emit
-- once we exit the curl.post_sse loop so the caller sees
-- usage as the LAST event in the stream order.
end
-- ... existing text + tool_call handling (unchanged) ...
end
-- After curl.post_sse returns (stream complete). R3-related:
-- only emit on successful streams; transport / api errors skip
-- the usage event (caller sees the error path and accumulator
-- stays unchanged).
if api_err then return nil, "api: " .. api_err end
if not ok then return nil, "transport: " .. tostring(err) end
if final_usage then on_delta("usage", final_usage) end
return true
```
### `M.chat` capture (R1 — BLOCKER fix)
`M.chat` is the non-streaming buffering wrapper. Its existing on_delta
only captured text. Under Phase 7 it MUST also capture the usage
payload — otherwise EVERY non-streaming caller (summarize, delegate,
memory_summarize, probe — 4 of 5 categories) silently reports zero.
```lua
function M.chat(model_cfg, messages, opts)
local parts = {}
local captured_usage -- R1: required so M.chat returns (text, usage)
local ok, err = M.chat_stream(model_cfg, messages,
function(kind, payload)
if kind == "text" then parts[#parts + 1] = payload
elseif kind == "usage" then captured_usage = payload
end
end, opts)
if not ok then return nil, err end
return table.concat(parts), captured_usage
end
```
Existing callers that do `local r = broker.chat(...)` automatically
drop the second value (Lua semantics). Callers that want usage do
`local r, u = broker.chat(...)`.
### Outbound include_usage
```lua
local body_table = { model = ..., messages = ..., stream = true }
if opts.include_usage ~= false then
body_table.stream_options = { include_usage = true }
end
```
Risk: some providers reject unrecognized fields. Baseline check; if any
host throws on `stream_options`, the per-model opt-out is one line.
### Category tagging
`opts.category` is a string set by the caller. broker echoes it into the
emitted usage payload so the accumulator knows what to credit. Default
category if absent: `"main"`.
---
## 5. Pillar 2 — Accumulator on ctx
### Shape
```lua
ctx.usage_totals = {
-- [model_name] = { [category] = { prompt = N, completion = N,
-- calls = N, cost = N } }
fast = {
main = { prompt = 1234, completion = 567, calls = 14, cost = 0 },
},
cloud = {
main = { prompt = 3850, completion = 980, calls = 8, cost = 0.0180 },
delegate = { prompt = 250, completion = 80, calls = 1, cost = 0.0012 },
probe = { prompt = 150, completion = 30, calls = 1, cost = 0.0042 },
},
}
ctx.cost_warn_fired = false
```
### add_usage
```lua
function Context:add_usage(model, category, u)
model = model or "?"
category = category or "main"
self.usage_totals = self.usage_totals or {}
local m = self.usage_totals[model] or {}
local c = m[category] or {
prompt = 0, completion = 0, calls = 0, cost = 0,
is_local = false, -- R6: cloud unless any usage came w/o cost
}
c.prompt = c.prompt + (u.prompt_tokens or 0)
c.completion = c.completion + (u.completion_tokens or 0)
c.calls = c.calls + 1
-- R6: preserve nil-vs-0 distinction. A `nil` cost means the
-- provider doesn't emit cost (i.e., local llama.cpp). Sticky:
-- once a slot has seen any nil-cost call, it's flagged is_local.
if u.cost == nil then
c.is_local = true
else
c.cost = c.cost + u.cost
end
m[category] = c
self.usage_totals[model] = m
end
function Context:total_cost()
local total = 0
for _, m in pairs(self.usage_totals or {}) do
for _, c in pairs(m) do total = total + c.cost end
end
return total
end
function Context:total_tokens()
local p, comp = 0, 0
for _, m in pairs(self.usage_totals or {}) do
for _, c in pairs(m) do
p = p + c.prompt
comp = comp + c.completion
end
end
return p, comp
end
```
### Reset semantics
`Context:reset()` deliberately does NOT clear `usage_totals` —
matches R8 invariant from Phase 6 (`:reset` clears `turns`,
`pending_exec_output`, `summary`; preserves `memory_items`, `project`,
and now `usage_totals`). The user reset their conversation, not their
cost meter. `:cost reset` is the explicit reset verb for the meter.
---
## 6. Pillar 3 — `:cost` meta
```
:cost summary line
:cost detail per-model + per-category breakdown
:cost reset zero out ctx.usage_totals + cost_warn_fired
```
Summary format (R10 — 6-decimal precision for sub-cent costs):
```
[aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens
cost=$0.023400 (cloud only; local: 0)
```
Detail format (R7 — sort key is `(cost desc, model asc, category asc)`
for deterministic ordering on equal-cost rows; R6 — annotation comes
from the slot's `is_local` flag, NOT a `cost == 0` heuristic):
```
[aish] session usage detail:
cloud main 8 calls, 3,850 / 980 tokens, $0.018000
cloud delegate 1 call, 250 / 80 tokens, $0.001200
cloud probe 1 call, 150 / 30 tokens, $0.004200
fast main 14 calls, 8,200 / 2,100 tokens, $0 (local)
```
Implementation: pure Lua iteration over `ctx.usage_totals`; no broker
calls. Sort flattens into a list, sorts via `table.sort` with explicit
3-level comparator: `cost desc, model asc, category asc`.
---
## 7. Pillar 4 — Warning thresholds
Config:
```lua
cost = {
warn_at_dollars = 0.50, -- emit once when cumulative cost crosses
warn_at_tokens = 100000, -- emit once when cumulative tokens crosses
}
```
R5 centralizes the check inside a single `_record_usage(model, cat, u)`
helper in repl.lua. This is the ONLY place that calls
`ctx:add_usage`; safety.lua call sites route through it via the
`helpers.on_usage` / `opts.on_usage` callback. Keeps `context.lua`
decoupled from `renderer` (no module-coupling violation).
R4: two independent flags (one per threshold) — first-to-fire must
NOT suppress the other.
```lua
-- repl.lua (sketch):
local function _record_usage(model, category, u)
ctx:add_usage(model, category, u)
if not (config.cost) then return end
ctx.cost_warn_state = ctx.cost_warn_state or { dollars = false, tokens = false }
local cw = ctx.cost_warn_state
if config.cost.warn_at_dollars and not cw.dollars then
local cost = ctx:total_cost()
if cost >= config.cost.warn_at_dollars then
-- R10: 6-decimal format for sub-cent visibility
renderer.status(("session cost $%.6f has crossed warn_at_dollars=$%.6f")
:format(cost, config.cost.warn_at_dollars))
cw.dollars = true
end
end
if config.cost.warn_at_tokens and not cw.tokens then
local p, c = ctx:total_tokens()
if (p + c) >= config.cost.warn_at_tokens then
renderer.status(("session tokens %d has crossed warn_at_tokens=%d")
:format(p + c, config.cost.warn_at_tokens))
cw.tokens = true
end
end
end
```
One-shot per threshold per session. `:cost reset` clears both
totals AND both warn flags atomically.
---
## 8. UX Surface Summary
| Meta | Behavior |
|---|---|
| `:cost` | One-line summary: calls / tokens / cost |
| `:cost detail` | Per-model + per-category breakdown |
| `:cost reset` | Zero out totals + clear warn-fired flag |
| Config | Default | Effect |
|---|---|---|
| `cfg.cost.warn_at_dollars` | nil | Status when cumulative cost first crosses this dollar amount |
| `cfg.cost.warn_at_tokens` | nil | Status when cumulative total tokens first crosses |
| (broker `opts.include_usage`) | true | Adds `stream_options.include_usage = true` to outbound request |
R9 boundary note: `:resume <name>` reloads turns for conversation
continuity but does NOT reconstruct `ctx.usage_totals` from the
per-turn `usage` fields stored in the session JSONL. After `:resume`,
the cost meter starts fresh from zero for the resumed session's live
calls. The historical usage IS in the JSONL for after-the-fact
scripting; cross-session aggregation is Q-C2 deferred work.
---
## 9. Out of Scope (Phase 7)
- **Cross-session cost persistence** — Q-C2 defers `<history.dir>/cost.jsonl`
rollup; v1 is session-only. Per-turn usage IS in the session JSONL for
after-the-fact aggregation if anyone wants to script it.
- **Per-model rate limiting / cost caps that REFUSE the call** — v1 only
warns. A future phase could add a hard cap that aborts before the
broker call.
- **Pricing-table fallback for local models** — if a local model doesn't
emit `usage.cost`, we record 0. Estimating cost from token count + a
static pricing table is a future polish (most users won't care about
local "cost" anyway — local is free).
- **Pretty token-bandwidth charts / sparklines** — out of scope; the
detail breakdown is text-only.
- **Estimated cost for future turns** — no preflight cost prediction.
- **MCP tool-call usage** — MCP servers don't expose token usage;
broker calls invoked DURING MCP tool dispatch ARE captured (because
they go through the same path), but the MCP tool call itself isn't.
---
## 10. Risks
| Risk | Mitigation |
|---|---|
| Some providers reject `stream_options` -> SSE errors at the top of the stream | `opts.include_usage = false` opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior |
| OpenRouter `cost` field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) | Capture `usage.cost` as-is (number); document that the same provider must be used for cross-call comparison |
| Local llama.cpp returns no `cost` -> displayed `$0` could mislead user "is this REALLY free?" | `:cost detail` annotates local lines with `(local)` literal; summary says `cost=$X (cloud only; local: 0)` |
| `ctx.usage_totals` grows unboundedly with new model names mid-session | Bounded by `#models in config` × `#categories` — small constants. No mitigation needed. |
| Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold | Acceptable for v1; user can `:cost reset` to re-arm. Future polish: warn at each Nx multiple. |
| R8: `call_broker` fallback retry passes `opts.include_usage` unchanged | Documented assumption: B1 confirmed both backends accept the flag. If a future fallback host rejects, the call-site that knows can pass `opts.include_usage = false` explicitly. |
---
## 11. Open Questions (Phase 7)
| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-C1 | Provider-without-usage handling | A10 — defensive silent skip; baseline probe will confirm shape on local llama.cpp. |
| Q-C2 | Cross-session cost persistence (`cost.jsonl`) | Deferred to follow-up phase 8; v1 is session-only. |
| Q-C3 | Categories closed-set vs free-form | A4 — **free-form**; caller decides. Matches Phase 6 helpers/skills convention. |
| Q-C4 | `stream_options` forwarding by hossenfelder | B1 RESOLVED — both backends accept; flag is REQUIRED for local llama.cpp, no-op for cloud. Default-true is correct. |
| Q-C5 | Warn fires on the crossed call or the next | A5 — **on the crossed call** (no UX-defeating delay). |
| Q-C6 | `:reset` clears `cost_warn_fired` | A6 — **no**, only `:cost reset` clears the flag (R8 parity). |
---
## 12. Phase 7 → Phase 8+ Out-of-band
Candidate follow-ups (non-binding):
- **Phase 8**: cross-session cost persistence (Q-C2 deferral), with
optional cost dashboards / weekly rollup reporter.
- **Hard rate limits / cost caps that REFUSE the call** — an extension
of the warn surface that promotes warnings into preflight enforcement.
- **Better tokenization** (Q1 deferred-from-Phase-3): replace the char/4
heuristic on `Context:estimate_tokens()` with model `/tokenize` calls.
Indirectly improves accuracy of any future "preflight cost predictor".
Phase 7 itself is self-contained — no upstream dependencies.
---
## 13. Implementation Plan (commit-by-commit)
Bottom-up; broker first (it's the egress point that all callers
depend on), then context (the accumulator), then the call-site
rewires, then the user-facing meta + warn surface, then config +
status bump. Each commit leaves the tree green (existing tests +
load smoke + per-commit feature smoke).
### Order
1. **`broker.lua` — usage capture + signature widening.**
- `build_request(model_cfg, messages, stream, opts)` widened to
take an opts table; opts.tools / opts.max_tokens fold in from
the existing positional args.
- **R3: TWO internal callers of `build_request` exist inside
broker.lua itself** (`M.chat_stream` at line 65-66 and indirectly
via `M.chat`). Both must be updated in this commit; the
migration is CONTAINED but not zero-touch. "Every caller already
passes opts" refers to the public surface — internal `build_request`
was positional.
- Opts.include_usage (default true) adds `stream_options.include_usage
= true` to the request body (per B1, required for local).
- `M.chat_stream` event loop adds `if doc.usage then final_usage =
doc.usage end`; after `curl.post_sse` returns, if `final_usage`
is set, `on_delta("usage", payload)` is called. Payload includes
`model = model_cfg.model` (caller-stable per B4 + R2), the raw
token counts, and `cost` as a number (nil for local per B3).
- opts.category passthrough — the broker just echoes it into the
emitted usage payload; doesn't validate (per A4 free-form).
- **R1: `M.chat` (non-streaming wrapper) MUST capture usage in its
internal on_delta and return `(text, usage)`. Without this, four
out of five non-streaming categories silently report zero.** §4
shows the explicit update.
- Smoke: hand-build a request with stream_options, capture all
three on_delta kinds (text, tool_call when applicable, usage),
confirm usage payload matches what curl shows. Also smoke
`broker.chat(...)` returns non-nil usage for cloud calls.
2. **`context.lua` — accumulator + helpers.**
- `Context.new`: `self.usage_totals = {}` + `self.cost_warn_fired = false`.
- `Context:add_usage(model, category, usage)` — increments
`usage_totals[model][category]` slots.
- `Context:total_cost()` — sums all cost fields across all models/categories.
- `Context:total_tokens()` — sums prompt + completion separately.
- `Context:reset` — does NOT touch `usage_totals` or `cost_warn_fired`
(R8 parity with `memory_items` and `project`).
- Smoke: 4-case inline test of add_usage / totals / reset preservation.
3. **`repl.lua` — wire opts.category + on_delta("usage") at non-Norris call sites.**
**N3: depends on commit 1's R1 M.chat fix shipping first.** This
commit's "capture the second return value" pattern only works
after M.chat actually returns one.
- `_record_usage(model, category, usage)` helper (R5) — the single
chokepoint that wraps `ctx:add_usage` AND does the warn check.
Replaces all direct `ctx:add_usage(...)` invocations in repl.lua.
- call_broker wrapper (used by ask_ai): pass `opts.category =
"main"`; the wrapped on_delta handles `kind == "usage"` by
calling `_record_usage(payload.model, payload.category, payload)`
— keys by **payload.model** per R2 (handles fallback retry
correctly without tracking primary-vs-fallback at the wrapper).
- DELEGATE: handler: opts.category = "delegate"; capture second
return value from broker.chat and feed to `_record_usage`.
- :delegate meta: opts.category = "delegate"; same.
- summarize-on-evict callback: opts.category = "summarize"; same.
- :memory summarize: opts.category = "memory_summarize"; same.
- Smoke: send one cloud prompt, observe ctx.usage_totals grows;
also smoke the fallback path with a deliberately-broken primary
and confirm usage credits the fallback model name (R2 verification).
4. **`safety.lua` — opts.category for Norris + probe.**
- safety.norris_step's broker.chat_stream call: pass `opts.category
= "norris"`. The on_delta wrapper inside safety.lua already
widens (post-#52) to handle `kind == "text"` (rehydration);
now also handles `kind == "usage"` by calling
`helpers.on_usage(payload.model, payload.category, payload)`.
R5: helpers.on_usage IS repl.lua's `_record_usage`.
- **N4 signature chain widening**: `llm_probe`, `llm_second_opinion`,
and `M.is_destructive` all widen to thread `opts.on_usage` through:
- `llm_probe(model_cfg, system, cmd, opts)` — pass `opts.category
= "probe"` to broker.chat; on the `(text, usage)` return,
if `opts.on_usage` AND `usage`, call `opts.on_usage(usage.model,
usage.category, usage)`.
- `llm_second_opinion(cmd, cfg, opts)` — pass opts through to
both llm_probe calls (probe 1 + probe 2 re-roll).
- `M.is_destructive(cmd, cfg, opts)` — opts.on_usage already in
the table from #52's scrub_msgs/rehydrate addition; threads
through naturally.
- Smoke: a Norris session shows both "norris" and "probe" category
entries in :cost detail; the probe model is named correctly
(e.g. "cloud" if safety.llm_model = "cloud").
5. **`repl.lua` — :cost meta + warn-threshold + HELP.**
- :cost (summary), :cost detail (per-model+category breakdown),
:cost reset (zero totals + clear cost_warn_fired).
- After every ctx:add_usage call (centralized in a helper if
possible), check cfg.cost.warn_at_dollars / warn_at_tokens;
emit one-shot status if crossed AND cost_warn_fired is false.
- HELP gains 3 lines for :cost.
- Smoke: :cost shows totals; :cost detail breaks down; warn fires
once when threshold crossed; :cost reset re-arms.
6. **`config.lua` example block + `docs/PHASE7.md` status bump.**
- Commented-out `cost = { warn_at_dollars = 0.50, warn_at_tokens
= 100000 }` block in config.lua.
- **N5: PHASE0.md §11 amendment is already in tree** (committed
at `3bad07b` with the formulate doc). Commit 6 must NOT re-apply.
- PHASE7.md status header → **Implement** (matches Phase 5/6
cadence — manifest tracks implementation state).
### Risk index per commit
| Commit | Risk | Mitigation |
|---|---|---|
| 1 (broker) | R3: build_request has TWO INTERNAL callers in broker.lua; both must be updated in this commit | Explicit in commit-1 note above; grep `build_request\(` to confirm |
| 1 (broker) | R1: M.chat must capture usage in on_delta and return (text, usage) | §4 shows the explicit M.chat update; smoke test verifies non-nil usage on cloud call |
| 1 (broker) | `M.chat` second return value confuses callers that do `local r = broker.chat(...)` discarding the second | Lua doesn't error on dropped return values; backward-compat preserved automatically |
| 2 (context) | usage_totals nil on old ctx serializations | Defensive `self.usage_totals = self.usage_totals or {}` in add_usage; no migration needed |
| 3 (repl wires) | Forgetting one call site = silent under-count | Lint by grep for `broker.chat\(` and `broker.chat_stream\(` after the wire commit; ensure each is tagged with opts.category |
| 3 (repl wires) | R2: fallback retry credits usage to wrong model | wrapped on_delta keys by `payload.model` (set inside broker per R2), NOT by outer `model_name`; smoke a deliberately-broken-primary case |
| 4 (safety wires) | safety.lua must NOT introduce new module dep | Use helpers.on_usage callback convention (matches #52's scrub_msgs) |
| 4 (safety wires) | N4: llm_probe → llm_second_opinion → is_destructive signature chain widening | Spelled out in commit-4 note above |
| 5 (:cost + warn) | warn fires multiple times when threshold is much exceeded by one call | per-threshold one-shot flag in `ctx.cost_warn_state`; explicit :cost reset to re-arm both |
| 5 (:cost + warn) | R4: single shared flag covers two thresholds | RESOLVED — split into `cost_warn_state.dollars` + `.tokens` |
| 6 (config + status) | N5: PHASE0 §11 already amended at `3bad07b` | This commit does NOT re-apply the amendment |
### Tests + smoke per commit
Each commit:
- Pass `luajit test_safety.lua` (87/87) and `luajit test_router_model.lua` (31/31)
- Load cleanly via `luajit -e 'package.path=...; require("repl"); print("ok")'`
- Pass a per-feature smoke (described in each row above)
### Things deliberately NOT split
- broker.chat backward-compat shim — Lua's multiple-return-values
semantics handle it automatically (existing `local r = broker.chat(..)`
drops the new `usage` value).
- Per-category sub-tables — flat `model -> category -> counters` is
simple enough; nesting deeper for e.g. timestamps is v2.
- Cross-session persistence — explicitly Q-C2 deferred to phase 8.
### Open at plan-time (resolve at implement)
- Whether `safety.is_destructive`'s opts should carry `on_usage`
callback explicitly OR thread through cfg.helpers (the latter
matches the Norris helpers convention but is more coupling).
Decide at commit 4. Default to explicit opts.on_usage for minimum
surface.
- Whether to emit a `[aish] usage: model=X prompt=N completion=M cost=$X`
status line PER TURN (verbose mode) or only via :cost on demand.
v1 = on demand only; verbose mode is a follow-up nice-to-have.