aish/docs/PHASE7.md

# aish — Phase 7 Manifest

**Project:** aish — AI-augmented conversational shell
**Document:** Phase 7 Requirements, Architecture & Design Decisions
**Status:** Implement (6 commits landed: 7364963, 7b4a9be, 8adebd5, b30212a, 0d6ff93, this)
**Date:** 2026-05-16

**Review findings (independent Sonnet agent, 2026-05-16) — 3 BLOCKERs
resolved in-place, 6 CONCERNs folded, 5 NITs applied:**

R1 (BLOCKER, RESOLVED). **`M.chat` would silently return `(text, nil)`
    for ALL non-streaming callers.** `M.chat`'s internal on_delta only
    captures `kind == "text"`. Without explicit handling of
    `kind == "usage"`, four out of five categories that go through
    `broker.chat` (summarize / delegate / memory_summarize / probe)
    would report zero usage even after a cloud round-trip. **Fix
    folded into §4 + §13 commit 1:** M.chat's on_delta also captures
    the usage payload and returns it as the second value.

R2 (BLOCKER, RESOLVED). **`call_broker` fallback retry — usage
    payload's `model` field credits the WRONG model name.** The
    `wrapped` on_delta in call_broker is closed over the PRIMARY's
    name; if the wrapped function uses an outer-scope `model_name`
    variable to key the accumulator, the fallback's usage gets
    misattributed. **Resolution:** the broker emits `payload.model =
    model_cfg.model` (which IS the fallback's model when called with
    `fb_cfg` — chat_stream's local upvar). The wrapper keys by
    `payload.model`, NOT by the outer `model_name`. Documented in
    §4 emission code + §13 commit 3 (wrapped on_delta uses
    `payload.model` for accumulator keying).

R3 (BLOCKER, RESOLVED — promoted to docs). **`build_request` has
    TWO internal callers inside broker.lua itself**, not just the
    public surface. Migration is contained but both internal sites
    must be updated in commit 1. Plan §13 commit 1 risk row updated
    to call this out explicitly so the implementer doesn't read
    "every caller already passes opts" as "only external callers
    need touching".

R4 (CONCERN, FOLDED). **Single `cost_warn_fired` flag for two
    thresholds is broken.** When both warn_at_dollars AND
    warn_at_tokens are configured, the first-to-fire suppresses the
    other. **Fix:** `ctx.cost_warn_fired` becomes `ctx.cost_warn_state
    = { dollars = false, tokens = false }`. Each threshold has its
    own flag; `:cost reset` clears both. §7 pseudocode updated.

R5 (CONCERN, FOLDED). **Warn-check centralization decided:** use a
    single `_record_usage(model, category, usage)` helper inside
    repl.lua that wraps `ctx:add_usage` AND does the threshold check
    AND calls renderer.status when crossed. `context.lua` stays
    decoupled from `renderer`. safety.lua call sites get
    `helpers.on_usage = _record_usage` in the helpers table; probe
    callsite gets `opts.on_usage = _record_usage`. Single chokepoint
    for the warn check. §3 + §7 + §13 commits 3-5 reflect.

R6 (CONCERN, FOLDED). **`nil` vs `0` cost distinction must be
    preserved at the accumulator level.** Local-model `$0` (no cost
    field) vs cloud-call-that-happens-to-cost-zero need to be
    distinguishable for `:cost detail` annotation. **Fix:** accumulator
    slot gains `is_local = true` when ANY recorded usage for that
    slot had `cost == nil`. Cloud calls with `cost = 0` (rare) stay
    annotated as cloud. §5 pseudocode + §6 annotation logic updated.

R7 (CONCERN, FOLDED). **`:cost detail` sort needs three-level key
    for determinism.** Lua's `table.sort` is unstable; equal-cost
    rows would have arbitrary order. **Fix:** sort key is
    `(cost desc, model asc, category asc)`. §6 updated.

R8 (CONCERN, FOLDED). **`call_broker` fallback passes `opts.include_usage`
    unchanged.** Documented as a known assumption (B1 confirms both
    backends accept; if a future fallback host rejects, the call-site
    can pass `include_usage = false` explicitly). §10 risk row added.

R9 (CONCERN, FOLDED). **`:resume` does NOT restore historical
    `usage_totals`.** Per-turn usage IS in the session JSONL but
    `:resume` reloads turns for conversation continuity only; the
    accumulator stays empty. Documented in §8 surface notes; users
    who want cross-session totals can script the jsonl or wait for
    the deferred Q-C2 follow-up.

R10 (CONCERN, FOLDED). **`$%.4f` loses sub-cent precision.** A
     `0.000028` cloud cost displays as `$0.0000` — indistinguishable
     from `$0` local. **Fix:** format strings widened to `$%.6f` in
     §6 (and the warn message in §7). 6 decimal places accommodates
     the smallest observed real cost.

R-N1..N5 (NITs, APPLIED):

  N1. §4 extraction pseudocode gains a comment noting the
      `if doc.usage` branch is INDEPENDENT of the choice branch and
      must be checked regardless of choice nil-ness (handles both
      B2 emission shapes).
  N2. §2 "Cost extraction" row referenced stale "B7"; corrected to B3.
  N3. §13 commit 3 row gains an explicit dependency note: commit 3's
      "capture the new second return value" requires commit 1's M.chat
      fix from R1 to ship first.
  N4. §3 safety.lua row + §13 commit 4 row spell out the signature
      chain: `llm_probe` → `llm_second_opinion` → `M.is_destructive`
      all widen to thread `opts.on_usage` through.
  N5. §3 PHASE0.md row + §13 commit 6 row — the PHASE0 §11 amendment
      is ALREADY in tree (committed at `3bad07b` with the formulate
      doc). Commit 6 should NOT re-apply; only adds config.lua block
      + bumps PHASE7 status header.

**Analyze findings (2026-05-16):**

A1. **broker.chat_stream surface is clean for the extension.** The
    existing `on_event(data)` closure inside `M.chat_stream` already
    parses `doc.error` / `doc.choices` / `delta` / tool_calls — adding
    `if doc.usage then final_usage = ... end` is one block. Emission
    happens via a closure-local `final_usage` that the post-loop code
    in `chat_stream` reads and calls `on_delta("usage", final_usage)`
    on. `build_request` needs minor extension OR (cleaner) `chat_stream`
    inserts `stream_options.include_usage = true` into the body table
    AFTER `json.encode` — but we currently encode in `build_request`.
    Cleanest: extend `build_request(model_cfg, messages, stream, opts)`
    so it can read `opts.include_usage`. Phase 7 simplifies the
    signature in passing.

A2. **7 caller sites** identified for `opts.category` threading:

    | Site | Category |
    |---|---|
    | `safety.lua:191` (LLM probe) | `"probe"` |
    | `safety.lua:354` (norris main) | `"norris"` |
    | `repl.lua:326` (summarize-on-evict) | `"summarize"` |
    | `repl.lua:685` (call_broker wrapper, used by ask_ai) | `"main"` |
    | `repl.lua:1104` (DELEGATE: handler) | `"delegate"` |
    | `repl.lua:1587` (:memory summarize) | `"memory_summarize"` |
    | `repl.lua:2156` (:delegate meta) | `"delegate"` |

    All callers pass `opts` already; adding a `category` field is
    additive and backward-compatible (default to `"main"` when absent).

A3. **`build_request` signature simplification.** Today it takes
    `(model_cfg, messages, stream, tools, max_tokens)` — five positional
    args. With Phase 7 needing `include_usage` AND `stream_options`,
    positional growth gets unwieldy. **Resolution:** widen to
    `(model_cfg, messages, stream, opts)` where opts carries
    `{tools, max_tokens, include_usage, stream_options}`. Callers in
    `M.chat_stream` and `M.chat` pass their existing opts table through.
    This is a refactor but contained inside broker.lua.

A4. **Q-C3 RESOLVED: free-form categories.** The closed-set vs free-form
    debate resolved in favor of free-form per the helpers/skills
    convention already in place (Phase 6 :tree / :diff metas don't
    validate sub-args either). `:cost detail` will show whatever
    categories appear — small + documented closed set in practice
    (7 entries from A2), no surprise.

A5. **Q-C5 RESOLVED: warn fires on the call that crossed.** The crossed
    call's usage IS in the accumulator at the moment we check (we
    check AFTER `add_usage`). Firing on the NEXT call would mean a
    delay of one full broker round-trip before the user sees the
    warn — defeats the purpose. Just emit-on-cross.

A6. **Q-C6 RESOLVED: `:reset` does NOT clear `cost_warn_fired`.**
    Parity with `usage_totals` itself (per the §2 decision row); the
    user reset their conversation, not their cost meter. The flag
    AND the totals are reset only by the explicit `:cost reset` verb.

A7. **Norris call-graph rewires (existing safety.lua:354 path):** with
    issue #52 wired (commit `955bd82`), the Norris broker call now
    passes `helpers.scrub_msgs` / `helpers.streaming_rehydrator`. The
    on_delta wrapping pattern means I need to be careful that the new
    `("usage", payload)` kind also flows through any wrapper. Since
    secrets streaming_rehydrator only matches on `kind == "text"`, the
    "usage" kind passes through unchanged. No new entanglement.

A8. **`ctx.usage_totals` survives `:reset` per R8** — same invariant
    as `memory_items` (Phase 4) and `project` (Phase 6). Documented in
    §5 of the manifest; reinforces the "ambient context survives
    conversation reset" rule.

A9. **Session JSONL serialization** — assistant turn dict gets an
    optional `usage` field. `history.lua` log_turn currently calls
    `json.encode(turn)` opaquely; the dkjson serializer handles nested
    tables. No code change needed; the new field flows through
    automatically when the assistant turn carries one.

A10. **Q-C1 PARTIAL: local providers may not emit `usage`.** The
     formulate-time assumption was "treat absence as zero-cost / unknown".
     A real probe against `qwen-coder-7b-snappy-8k` is a baseline
     action — see B-probes below. The implementation will be defensive:
     if `doc.usage` never appears in the stream, no "usage" event is
     emitted, and the accumulator is unchanged for that turn. `:cost`
     output naturally reflects "0 calls counted for local model" if
     that's the case.

A11. **Q-C4 deferred to baseline**: actual `stream_options` forwarding
     by the hossenfelder proxy must be probed against a live broker.
     If the proxy strips the option, we get no `usage` events even
     for cloud calls. Baseline action.

PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest
specifies what Phase 7 adds — **cost / usage observability**: the ability
to know, mid-session, how many tokens you've spent and how much money the
paid-cloud calls have cost.

PHASE0 §11 originally listed phases only through 6; this commit amends
§11 to add Phase 7.

---

## 1. Scope of Phase 7

Four pillars:

1. **Usage capture in broker** — `broker.chat_stream` extracts the
   provider's `usage` block (and `cost` where present) from the response
   stream. Surfaces it to the caller via a new `on_delta("usage", ...)`
   kind. The existing `broker.chat` buffering wrapper exposes it as a
   second return value `(text, usage)`. Backward-compatible: callers
   that don't handle the new kind / second value simply ignore it.

2. **Per-session accumulator on `ctx`** — running totals per-model AND
   per-call-category (main / delegate / summarize / probe) accumulate on
   `ctx.usage_totals`. No persistence across sessions in v1 (Q-C2
   defers cross-session); the session-log JSONL files DO carry per-turn
   usage so historical analysis is possible after the fact.

3. **`:cost` meta** — a `:cost` reporter that shows the current session
   totals, with optional `:cost detail` for the per-model + per-category
   breakdown. Zero broker calls (purely local read of `ctx.usage_totals`).

4. **Optional warning thresholds** — `cfg.cost.warn_at_dollars` and
   `cfg.cost.warn_at_tokens` emit a status the first time the running
   total crosses the configured threshold. Default off (no warnings
   without config). Useful when cloud presets are configured and you
   want a "you've spent $1 this session" nudge before runaway cost.

**Phase 7 is done when:**

- `broker.chat_stream` exposes usage via the new `on_delta("usage", ...)`
  callback kind; `broker.chat` returns `(text, usage)`. Backward compat
  preserved (no existing caller breaks).
- After a session with mixed local + cloud calls, `:cost` prints a
  total like:
  ```
  [aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens
                                  cost=$0.0234 (cloud only; local: 0)
  ```
- `:cost detail` breaks down by model + category:
  ```
  fast    main: 14 turns, 8200/2100 tokens
  cloud   main: 8 turns, 3850/980 tokens, $0.0180
  cloud   delegate: 1 turn, 250/80 tokens, $0.0012
  cloud   probe: 1 turn, 150/30 tokens, $0.0042
  ```
- Session JSONL gains a `usage` field on assistant turns (when the
  broker returned one).
- With `cfg.cost.warn_at_dollars = 0.50` set, crossing $0.50 cumulative
  emits exactly one status line.
- Existing configs without `cfg.cost` behave exactly like Phase 6
  (Phase 6 regression coverage).

---

## 2. Technology Decisions (delta from Phase 6)

| Decision | Choice | Rationale |
|---|---|---|
| Where to extract usage | In `broker.chat_stream` event loop, looking at each SSE event's `usage` field on the final chunk | The OpenAI streaming spec puts `usage` on the FINAL chunk when `stream_options: { include_usage: true }` is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline). |
| New on_delta kind | `on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? })` | Mirrors the existing `("text", chunk)` / `("tool_call", call)` shape. Callers ignore unknown kinds; backward-compatible. |
| Where to enable usage on the wire | `opts.include_usage = true` (default `true`) sets `stream_options.include_usage = true` in the outbound request body | Off-switch for hosts that reject `stream_options`. Defaults on; baseline probe confirms current broker tolerates it. (A3: `build_request` signature widens to take an `opts` table; positional growth was getting unwieldy.) |
| Accumulator location | `ctx.usage_totals[model_name][category]` table | ctx is per-conversation; matches the `:reset`-survives-or-not rules already in place. |
| Categories | `"main"` (ask_ai), `"delegate"`, `"summarize"`, `"memory_summarize"`, `"probe"`, `"norris"` | One-tag-per-call-site. Tagged at the caller site (caller passes `opts.category` to `broker.chat_stream`). |
| Cost extraction | `usage.cost` (OpenRouter convention; dollars as a number). For Anthropic/Bedrock the cost arrives in dollars on `usage.cost`. For pure local llama.cpp: no `cost` field — record as nil (R6 — preserves the local-vs-cloud-zero distinction in the accumulator). | Single field name across observed providers per baseline B3. |
| Cost precision | Store as `number` (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) | No floating-point cumulative-error concerns at this scale. |
| Warning trigger | First crossing of either threshold emits a single status: `[aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY`. Crossed-flag stored on ctx; reset only on session end / `:cost reset`. | One-shot to avoid spamming. |
| `:reset` interaction | `:reset` does NOT clear `ctx.usage_totals` (parity with `memory_items`/`project`) — the user reset their conversation, not their cost tracking. `:cost reset` is the explicit reset verb. | Matches R8 invariant from Phase 6. |
| Session-log persistence | Assistant turn entries gain an optional `usage` field when broker returned one. `history.lua` log_turn writes it through verbatim. | Per-turn granularity preserved for after-the-fact analysis. No new file. |

---

## 3. Module Changes

| File | State after Phase 6 | Phase 7 changes |
|---|---|---|
| `broker.lua` | `chat_stream(cfg, msgs, on_delta, opts)` with text + tool_call kinds; `chat` returns text | Extract usage from final SSE chunk; emit `on_delta("usage", payload)`; `chat` returns `(text, usage)`. New `opts.include_usage` (default true); new `opts.category` (passed through as a tag in the usage payload). |
| `context.lua` | system prompt + turns + memory + project + summary | Add `self.usage_totals` (table) + `self.cost_warn_fired` (bool). New helpers: `Context:add_usage(model, category, usage)`, `Context:total_cost()`, `Context:total_tokens()`. `Context:reset` does NOT clear `usage_totals` (parity with memory_items / project per R8). |
| `repl.lua` | ask_ai + delegate + summarize callbacks + Norris helpers | Wire `opts.category` at each broker call site (main / delegate / summarize / memory_summarize). Wire `on_delta("usage", ...)` -> `ctx:add_usage(...)`. New `:cost` and `:cost detail` / `:cost reset` metas. Cost-warn check after each `add_usage` call. |
| `safety.lua` | norris_step + is_destructive | Pass `opts.category = "norris"` (for the main chat_stream call) and `"probe"` (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since `safety.llm_model = "cloud"` is the recommended setting. |
| `history.lua` | session.log_turn appends JSONL entries | log_turn already takes turn opaquely; assistant turns will carry `usage` if present and it'll serialize via dkjson. No code change unless filter desired. |
| `config.lua` | example blocks for mcp/safety/memory/routing/secrets/hooks/project | Add commented-out `cost = { warn_at_dollars, warn_at_tokens }` block. |
| `docs/PHASE0.md` | §11 lists phases 0-6 | Amendment landed at `3bad07b` (formulate commit). N5: commit 6 does NOT re-apply. |

No new module files.

---

## 4. Pillar 1 — Usage capture in broker

### SSE shape (provider-by-provider — confirm in baseline)

For OpenAI-compatible streams with `stream_options: { include_usage: true }`:

```json
data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]}
data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]}
data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}}
data: [DONE]
```

The final usage event arrives AFTER `finish_reason` but BEFORE `[DONE]`.
`choices` is empty `[]` on the usage event.

For non-streaming `chat`: usage is in the response body at the top level.
broker.chat is a wrapper around chat_stream, so it inherits the on_delta
path.

For local llama.cpp via hossenfelder: usage may or may not be present
depending on the proxy's version. Treat absence as zero-cost / unknown.

### Extraction algorithm

```lua
local final_usage = nil

local function on_event(data)
    ...
    -- N1: this branch is INDEPENDENT of the choice branch below;
    -- check unconditionally. Per B2, local emits usage on a
    -- choices=[] chunk (choice nil); cloud emits on a non-empty
    -- choices chunk (with finish_reason). Both shapes funnel here.
    if doc.usage then
        -- R2: payload.model is ALWAYS the caller-stable model_cfg.model
        -- (chat_stream's local upvar). When called via call_broker's
        -- fallback retry, this naturally reflects the fallback's
        -- model name — wrapper callers can key by payload.model
        -- without tracking primary-vs-fallback themselves.
        final_usage = {
            prompt_tokens     = doc.usage.prompt_tokens or 0,
            completion_tokens = doc.usage.completion_tokens or 0,
            total_tokens      = doc.usage.total_tokens or 0,
            -- R6: keep nil-vs-0 distinction at this layer; the
            -- accumulator decides how to tag local-vs-cloud-zero.
            cost              = doc.usage.cost,   -- nil for local
            model             = model_cfg.model,  -- caller-stable per B4
            category          = opts.category or "main",
        }
        -- Don't emit yet — the [DONE] event marks stream end; emit
        -- once we exit the curl.post_sse loop so the caller sees
        -- usage as the LAST event in the stream order.
    end
    -- ... existing text + tool_call handling (unchanged) ...
end

-- After curl.post_sse returns (stream complete). R3-related:
-- only emit on successful streams; transport / api errors skip
-- the usage event (caller sees the error path and accumulator
-- stays unchanged).
if api_err then return nil, "api: " .. api_err end
if not ok    then return nil, "transport: " .. tostring(err) end
if final_usage then on_delta("usage", final_usage) end
return true
```

### `M.chat` capture (R1 — BLOCKER fix)

`M.chat` is the non-streaming buffering wrapper. Its existing on_delta
only captured text. Under Phase 7 it MUST also capture the usage
payload — otherwise EVERY non-streaming caller (summarize, delegate,
memory_summarize, probe — 4 of 5 categories) silently reports zero.

```lua
function M.chat(model_cfg, messages, opts)
    local parts        = {}
    local captured_usage  -- R1: required so M.chat returns (text, usage)
    local ok, err = M.chat_stream(model_cfg, messages,
        function(kind, payload)
            if kind == "text"  then parts[#parts + 1] = payload
            elseif kind == "usage" then captured_usage = payload
            end
        end, opts)
    if not ok then return nil, err end
    return table.concat(parts), captured_usage
end
```

Existing callers that do `local r = broker.chat(...)` automatically
drop the second value (Lua semantics). Callers that want usage do
`local r, u = broker.chat(...)`.

### Outbound include_usage

```lua
local body_table = { model = ..., messages = ..., stream = true }
if opts.include_usage ~= false then
    body_table.stream_options = { include_usage = true }
end
```

Risk: some providers reject unrecognized fields. Baseline check; if any
host throws on `stream_options`, the per-model opt-out is one line.

### Category tagging

`opts.category` is a string set by the caller. broker echoes it into the
emitted usage payload so the accumulator knows what to credit. Default
category if absent: `"main"`.

---

## 5. Pillar 2 — Accumulator on ctx

### Shape

```lua
ctx.usage_totals = {
    -- [model_name] = { [category] = { prompt = N, completion = N,
    --                                 calls = N, cost = N } }
    fast = {
        main      = { prompt = 1234, completion = 567, calls = 14, cost = 0   },
    },
    cloud = {
        main      = { prompt = 3850, completion = 980, calls = 8,  cost = 0.0180 },
        delegate  = { prompt = 250,  completion = 80,  calls = 1,  cost = 0.0012 },
        probe     = { prompt = 150,  completion = 30,  calls = 1,  cost = 0.0042 },
    },
}
ctx.cost_warn_fired = false
```

### add_usage

```lua
function Context:add_usage(model, category, u)
    model    = model    or "?"
    category = category or "main"
    self.usage_totals = self.usage_totals or {}
    local m = self.usage_totals[model] or {}
    local c = m[category] or {
        prompt = 0, completion = 0, calls = 0, cost = 0,
        is_local = false,  -- R6: cloud unless any usage came w/o cost
    }
    c.prompt     = c.prompt     + (u.prompt_tokens or 0)
    c.completion = c.completion + (u.completion_tokens or 0)
    c.calls      = c.calls      + 1
    -- R6: preserve nil-vs-0 distinction. A `nil` cost means the
    -- provider doesn't emit cost (i.e., local llama.cpp). Sticky:
    -- once a slot has seen any nil-cost call, it's flagged is_local.
    if u.cost == nil then
        c.is_local = true
    else
        c.cost = c.cost + u.cost
    end
    m[category] = c
    self.usage_totals[model] = m
end

function Context:total_cost()
    local total = 0
    for _, m in pairs(self.usage_totals or {}) do
        for _, c in pairs(m) do total = total + c.cost end
    end
    return total
end

function Context:total_tokens()
    local p, comp = 0, 0
    for _, m in pairs(self.usage_totals or {}) do
        for _, c in pairs(m) do
            p    = p    + c.prompt
            comp = comp + c.completion
        end
    end
    return p, comp
end
```

### Reset semantics

`Context:reset()` deliberately does NOT clear `usage_totals` —
matches R8 invariant from Phase 6 (`:reset` clears `turns`,
`pending_exec_output`, `summary`; preserves `memory_items`, `project`,
and now `usage_totals`). The user reset their conversation, not their
cost meter. `:cost reset` is the explicit reset verb for the meter.

---

## 6. Pillar 3 — `:cost` meta

```
:cost                       summary line
:cost detail                per-model + per-category breakdown
:cost reset                 zero out ctx.usage_totals + cost_warn_fired
```

Summary format (R10 — 6-decimal precision for sub-cent costs):

```
[aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens
                       cost=$0.023400 (cloud only; local: 0)
```

Detail format (R7 — sort key is `(cost desc, model asc, category asc)`
for deterministic ordering on equal-cost rows; R6 — annotation comes
from the slot's `is_local` flag, NOT a `cost == 0` heuristic):

```
[aish] session usage detail:
  cloud     main      8 calls,  3,850 / 980 tokens,   $0.018000
  cloud     delegate  1 call,     250 / 80  tokens,   $0.001200
  cloud     probe     1 call,     150 / 30  tokens,   $0.004200
  fast      main     14 calls,  8,200 / 2,100 tokens, $0       (local)
```

Implementation: pure Lua iteration over `ctx.usage_totals`; no broker
calls. Sort flattens into a list, sorts via `table.sort` with explicit
3-level comparator: `cost desc, model asc, category asc`.

---

## 7. Pillar 4 — Warning thresholds

Config:

```lua
cost = {
    warn_at_dollars = 0.50,    -- emit once when cumulative cost crosses
    warn_at_tokens  = 100000,  -- emit once when cumulative tokens crosses
}
```

R5 centralizes the check inside a single `_record_usage(model, cat, u)`
helper in repl.lua. This is the ONLY place that calls
`ctx:add_usage`; safety.lua call sites route through it via the
`helpers.on_usage` / `opts.on_usage` callback. Keeps `context.lua`
decoupled from `renderer` (no module-coupling violation).

R4: two independent flags (one per threshold) — first-to-fire must
NOT suppress the other.

```lua
-- repl.lua (sketch):
local function _record_usage(model, category, u)
    ctx:add_usage(model, category, u)
    if not (config.cost) then return end
    ctx.cost_warn_state = ctx.cost_warn_state or { dollars = false, tokens = false }
    local cw = ctx.cost_warn_state
    if config.cost.warn_at_dollars and not cw.dollars then
        local cost = ctx:total_cost()
        if cost >= config.cost.warn_at_dollars then
            -- R10: 6-decimal format for sub-cent visibility
            renderer.status(("session cost $%.6f has crossed warn_at_dollars=$%.6f")
                            :format(cost, config.cost.warn_at_dollars))
            cw.dollars = true
        end
    end
    if config.cost.warn_at_tokens and not cw.tokens then
        local p, c = ctx:total_tokens()
        if (p + c) >= config.cost.warn_at_tokens then
            renderer.status(("session tokens %d has crossed warn_at_tokens=%d")
                            :format(p + c, config.cost.warn_at_tokens))
            cw.tokens = true
        end
    end
end
```

One-shot per threshold per session. `:cost reset` clears both
totals AND both warn flags atomically.

---

## 8. UX Surface Summary

| Meta | Behavior |
|---|---|
| `:cost` | One-line summary: calls / tokens / cost |
| `:cost detail` | Per-model + per-category breakdown |
| `:cost reset` | Zero out totals + clear warn-fired flag |

| Config | Default | Effect |
|---|---|---|
| `cfg.cost.warn_at_dollars` | nil | Status when cumulative cost first crosses this dollar amount |
| `cfg.cost.warn_at_tokens` | nil | Status when cumulative total tokens first crosses |
| (broker `opts.include_usage`) | true | Adds `stream_options.include_usage = true` to outbound request |

R9 boundary note: `:resume <name>` reloads turns for conversation
continuity but does NOT reconstruct `ctx.usage_totals` from the
per-turn `usage` fields stored in the session JSONL. After `:resume`,
the cost meter starts fresh from zero for the resumed session's live
calls. The historical usage IS in the JSONL for after-the-fact
scripting; cross-session aggregation is Q-C2 deferred work.

---

## 9. Out of Scope (Phase 7)

- **Cross-session cost persistence** — Q-C2 defers `<history.dir>/cost.jsonl`
  rollup; v1 is session-only. Per-turn usage IS in the session JSONL for
  after-the-fact aggregation if anyone wants to script it.
- **Per-model rate limiting / cost caps that REFUSE the call** — v1 only
  warns. A future phase could add a hard cap that aborts before the
  broker call.
- **Pricing-table fallback for local models** — if a local model doesn't
  emit `usage.cost`, we record 0. Estimating cost from token count + a
  static pricing table is a future polish (most users won't care about
  local "cost" anyway — local is free).
- **Pretty token-bandwidth charts / sparklines** — out of scope; the
  detail breakdown is text-only.
- **Estimated cost for future turns** — no preflight cost prediction.
- **MCP tool-call usage** — MCP servers don't expose token usage;
  broker calls invoked DURING MCP tool dispatch ARE captured (because
  they go through the same path), but the MCP tool call itself isn't.

---

## 10. Risks

| Risk | Mitigation |
|---|---|
| Some providers reject `stream_options` -> SSE errors at the top of the stream | `opts.include_usage = false` opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior |
| OpenRouter `cost` field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) | Capture `usage.cost` as-is (number); document that the same provider must be used for cross-call comparison |
| Local llama.cpp returns no `cost` -> displayed `$0` could mislead user "is this REALLY free?" | `:cost detail` annotates local lines with `(local)` literal; summary says `cost=$X (cloud only; local: 0)` |
| `ctx.usage_totals` grows unboundedly with new model names mid-session | Bounded by `#models in config` × `#categories` — small constants. No mitigation needed. |
| Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold | Acceptable for v1; user can `:cost reset` to re-arm. Future polish: warn at each Nx multiple. |
| R8: `call_broker` fallback retry passes `opts.include_usage` unchanged | Documented assumption: B1 confirmed both backends accept the flag. If a future fallback host rejects, the call-site that knows can pass `opts.include_usage = false` explicitly. |

---

## 11. Open Questions (Phase 7)

| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-C1 | Provider-without-usage handling | A10 — defensive silent skip; baseline probe will confirm shape on local llama.cpp. |
| Q-C2 | Cross-session cost persistence (`cost.jsonl`) | Deferred to follow-up phase 8; v1 is session-only. |
| Q-C3 | Categories closed-set vs free-form | A4 — **free-form**; caller decides. Matches Phase 6 helpers/skills convention. |
| Q-C4 | `stream_options` forwarding by hossenfelder | B1 RESOLVED — both backends accept; flag is REQUIRED for local llama.cpp, no-op for cloud. Default-true is correct. |
| Q-C5 | Warn fires on the crossed call or the next | A5 — **on the crossed call** (no UX-defeating delay). |
| Q-C6 | `:reset` clears `cost_warn_fired` | A6 — **no**, only `:cost reset` clears the flag (R8 parity). |

---

## 12. Phase 7 → Phase 8+ Out-of-band

Candidate follow-ups (non-binding):

- **Phase 8**: cross-session cost persistence (Q-C2 deferral), with
  optional cost dashboards / weekly rollup reporter.
- **Hard rate limits / cost caps that REFUSE the call** — an extension
  of the warn surface that promotes warnings into preflight enforcement.
- **Better tokenization** (Q1 deferred-from-Phase-3): replace the char/4
  heuristic on `Context:estimate_tokens()` with model `/tokenize` calls.
  Indirectly improves accuracy of any future "preflight cost predictor".

Phase 7 itself is self-contained — no upstream dependencies.

---

## 13. Implementation Plan (commit-by-commit)

Bottom-up; broker first (it's the egress point that all callers
depend on), then context (the accumulator), then the call-site
rewires, then the user-facing meta + warn surface, then config +
status bump. Each commit leaves the tree green (existing tests +
load smoke + per-commit feature smoke).

### Order

1. **`broker.lua` — usage capture + signature widening.**
   - `build_request(model_cfg, messages, stream, opts)` widened to
     take an opts table; opts.tools / opts.max_tokens fold in from
     the existing positional args.
   - **R3: TWO internal callers of `build_request` exist inside
     broker.lua itself** (`M.chat_stream` at line 65-66 and indirectly
     via `M.chat`). Both must be updated in this commit; the
     migration is CONTAINED but not zero-touch. "Every caller already
     passes opts" refers to the public surface — internal `build_request`
     was positional.
   - Opts.include_usage (default true) adds `stream_options.include_usage
     = true` to the request body (per B1, required for local).
   - `M.chat_stream` event loop adds `if doc.usage then final_usage =
     doc.usage end`; after `curl.post_sse` returns, if `final_usage`
     is set, `on_delta("usage", payload)` is called. Payload includes
     `model = model_cfg.model` (caller-stable per B4 + R2), the raw
     token counts, and `cost` as a number (nil for local per B3).
   - opts.category passthrough — the broker just echoes it into the
     emitted usage payload; doesn't validate (per A4 free-form).
   - **R1: `M.chat` (non-streaming wrapper) MUST capture usage in its
     internal on_delta and return `(text, usage)`. Without this, four
     out of five non-streaming categories silently report zero.** §4
     shows the explicit update.
   - Smoke: hand-build a request with stream_options, capture all
     three on_delta kinds (text, tool_call when applicable, usage),
     confirm usage payload matches what curl shows. Also smoke
     `broker.chat(...)` returns non-nil usage for cloud calls.

2. **`context.lua` — accumulator + helpers.**
   - `Context.new`: `self.usage_totals = {}` + `self.cost_warn_fired = false`.
   - `Context:add_usage(model, category, usage)` — increments
     `usage_totals[model][category]` slots.
   - `Context:total_cost()` — sums all cost fields across all models/categories.
   - `Context:total_tokens()` — sums prompt + completion separately.
   - `Context:reset` — does NOT touch `usage_totals` or `cost_warn_fired`
     (R8 parity with `memory_items` and `project`).
   - Smoke: 4-case inline test of add_usage / totals / reset preservation.

3. **`repl.lua` — wire opts.category + on_delta("usage") at non-Norris call sites.**
   **N3: depends on commit 1's R1 M.chat fix shipping first.** This
   commit's "capture the second return value" pattern only works
   after M.chat actually returns one.
   - `_record_usage(model, category, usage)` helper (R5) — the single
     chokepoint that wraps `ctx:add_usage` AND does the warn check.
     Replaces all direct `ctx:add_usage(...)` invocations in repl.lua.
   - call_broker wrapper (used by ask_ai): pass `opts.category =
     "main"`; the wrapped on_delta handles `kind == "usage"` by
     calling `_record_usage(payload.model, payload.category, payload)`
     — keys by **payload.model** per R2 (handles fallback retry
     correctly without tracking primary-vs-fallback at the wrapper).
   - DELEGATE: handler: opts.category = "delegate"; capture second
     return value from broker.chat and feed to `_record_usage`.
   - :delegate meta: opts.category = "delegate"; same.
   - summarize-on-evict callback: opts.category = "summarize"; same.
   - :memory summarize: opts.category = "memory_summarize"; same.
   - Smoke: send one cloud prompt, observe ctx.usage_totals grows;
     also smoke the fallback path with a deliberately-broken primary
     and confirm usage credits the fallback model name (R2 verification).

4. **`safety.lua` — opts.category for Norris + probe.**
   - safety.norris_step's broker.chat_stream call: pass `opts.category
     = "norris"`. The on_delta wrapper inside safety.lua already
     widens (post-#52) to handle `kind == "text"` (rehydration);
     now also handles `kind == "usage"` by calling
     `helpers.on_usage(payload.model, payload.category, payload)`.
     R5: helpers.on_usage IS repl.lua's `_record_usage`.
   - **N4 signature chain widening**: `llm_probe`, `llm_second_opinion`,
     and `M.is_destructive` all widen to thread `opts.on_usage` through:
       - `llm_probe(model_cfg, system, cmd, opts)` — pass `opts.category
         = "probe"` to broker.chat; on the `(text, usage)` return,
         if `opts.on_usage` AND `usage`, call `opts.on_usage(usage.model,
         usage.category, usage)`.
       - `llm_second_opinion(cmd, cfg, opts)` — pass opts through to
         both llm_probe calls (probe 1 + probe 2 re-roll).
       - `M.is_destructive(cmd, cfg, opts)` — opts.on_usage already in
         the table from #52's scrub_msgs/rehydrate addition; threads
         through naturally.
   - Smoke: a Norris session shows both "norris" and "probe" category
     entries in :cost detail; the probe model is named correctly
     (e.g. "cloud" if safety.llm_model = "cloud").

5. **`repl.lua` — :cost meta + warn-threshold + HELP.**
   - :cost (summary), :cost detail (per-model+category breakdown),
     :cost reset (zero totals + clear cost_warn_fired).
   - After every ctx:add_usage call (centralized in a helper if
     possible), check cfg.cost.warn_at_dollars / warn_at_tokens;
     emit one-shot status if crossed AND cost_warn_fired is false.
   - HELP gains 3 lines for :cost.
   - Smoke: :cost shows totals; :cost detail breaks down; warn fires
     once when threshold crossed; :cost reset re-arms.

6. **`config.lua` example block + `docs/PHASE7.md` status bump.**
   - Commented-out `cost = { warn_at_dollars = 0.50, warn_at_tokens
     = 100000 }` block in config.lua.
   - **N5: PHASE0.md §11 amendment is already in tree** (committed
     at `3bad07b` with the formulate doc). Commit 6 must NOT re-apply.
   - PHASE7.md status header → **Implement** (matches Phase 5/6
     cadence — manifest tracks implementation state).

### Risk index per commit

| Commit | Risk | Mitigation |
|---|---|---|
| 1 (broker) | R3: build_request has TWO INTERNAL callers in broker.lua; both must be updated in this commit | Explicit in commit-1 note above; grep `build_request\(` to confirm |
| 1 (broker) | R1: M.chat must capture usage in on_delta and return (text, usage) | §4 shows the explicit M.chat update; smoke test verifies non-nil usage on cloud call |
| 1 (broker) | `M.chat` second return value confuses callers that do `local r = broker.chat(...)` discarding the second | Lua doesn't error on dropped return values; backward-compat preserved automatically |
| 2 (context) | usage_totals nil on old ctx serializations | Defensive `self.usage_totals = self.usage_totals or {}` in add_usage; no migration needed |
| 3 (repl wires) | Forgetting one call site = silent under-count | Lint by grep for `broker.chat\(` and `broker.chat_stream\(` after the wire commit; ensure each is tagged with opts.category |
| 3 (repl wires) | R2: fallback retry credits usage to wrong model | wrapped on_delta keys by `payload.model` (set inside broker per R2), NOT by outer `model_name`; smoke a deliberately-broken-primary case |
| 4 (safety wires) | safety.lua must NOT introduce new module dep | Use helpers.on_usage callback convention (matches #52's scrub_msgs) |
| 4 (safety wires) | N4: llm_probe → llm_second_opinion → is_destructive signature chain widening | Spelled out in commit-4 note above |
| 5 (:cost + warn) | warn fires multiple times when threshold is much exceeded by one call | per-threshold one-shot flag in `ctx.cost_warn_state`; explicit :cost reset to re-arm both |
| 5 (:cost + warn) | R4: single shared flag covers two thresholds | RESOLVED — split into `cost_warn_state.dollars` + `.tokens` |
| 6 (config + status) | N5: PHASE0 §11 already amended at `3bad07b` | This commit does NOT re-apply the amendment |

### Tests + smoke per commit

Each commit:
- Pass `luajit test_safety.lua` (87/87) and `luajit test_router_model.lua` (31/31)
- Load cleanly via `luajit -e 'package.path=...; require("repl"); print("ok")'`
- Pass a per-feature smoke (described in each row above)

### Things deliberately NOT split

- broker.chat backward-compat shim — Lua's multiple-return-values
  semantics handle it automatically (existing `local r = broker.chat(..)`
  drops the new `usage` value).
- Per-category sub-tables — flat `model -> category -> counters` is
  simple enough; nesting deeper for e.g. timestamps is v2.
- Cross-session persistence — explicitly Q-C2 deferred to phase 8.

### Open at plan-time (resolve at implement)

- Whether `safety.is_destructive`'s opts should carry `on_usage`
  callback explicitly OR thread through cfg.helpers (the latter
  matches the Norris helpers convention but is more coupling).
  Decide at commit 4. Default to explicit opts.on_usage for minimum
  surface.
- Whether to emit a `[aish] usage: model=X prompt=N completion=M cost=$X`
  status line PER TURN (verbose mode) or only via :cost on demand.
  v1 = on demand only; verbose mode is a follow-up nice-to-have.