Files
aish/docs/PHASE7.md
T
marfrit 3bad07b2da docs/PHASE7: formulate — cost / usage observability
Phase 7 formulate manifest + PHASE0 §11 amendment to add the Phase 7
row (substrate amendment per CLAUDE.md §3, lands in the same commit).

Four pillars:

  1. Usage capture in broker.chat_stream — extract `usage` from the
     final SSE chunk (OpenAI streaming spec with `stream_options:
     {include_usage: true}`). Surface via new on_delta("usage",
     payload) kind. broker.chat returns (text, usage) — backward-
     compat: existing callers ignore the second value.

  2. Per-session accumulator on ctx — ctx.usage_totals[model][category]
     tables (categories: main / delegate / summarize / memory_summarize
     / probe / norris, tagged at the call site via opts.category).
     :reset preserves usage_totals (R8 parity with memory_items /
     project). Session JSONL gains an optional `usage` field on
     assistant turns for after-the-fact analysis.

  3. :cost meta surface — :cost (summary), :cost detail (per-model +
     per-category breakdown), :cost reset (zero the meter). Pure-Lua
     read of ctx.usage_totals; no broker calls.

  4. Optional warn thresholds — cfg.cost.warn_at_dollars /
     warn_at_tokens emit a one-shot status when crossed. Default off;
     useful with cloud presets configured.

Doc covers scope + done-when criteria, tech decisions table, module
changes, per-pillar deep dive with code sketches, UX surface, out of
scope, risks, 6 open questions to resolve in analyze.

Open at formulate:
  Q-C1 — provider-without-usage handling (local llama.cpp probably)
  Q-C2 — cross-session persistence (defer to phase 8)
  Q-C3 — categories closed-set vs free-form
  Q-C4 — does hossenfelder forward stream_options to all backends?
  Q-C5 — warn fires on the call that crosses, or the next one?
  Q-C6 — :reset clears cost_warn_fired too, or only :cost reset?

Scope confirmed via AskUserQuestion: cost/usage observability
(chosen over project-local config overlay and session search/tag).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:47:58 +00:00

376 lines
18 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# aish — Phase 7 Manifest
**Project:** aish — AI-augmented conversational shell
**Document:** Phase 7 Requirements, Architecture & Design Decisions
**Status:** Formulate (pre-analyze)
**Date:** 2026-05-16
PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest
specifies what Phase 7 adds — **cost / usage observability**: the ability
to know, mid-session, how many tokens you've spent and how much money the
paid-cloud calls have cost.
PHASE0 §11 originally listed phases only through 6; this commit amends
§11 to add Phase 7.
---
## 1. Scope of Phase 7
Four pillars:
1. **Usage capture in broker**`broker.chat_stream` extracts the
provider's `usage` block (and `cost` where present) from the response
stream. Surfaces it to the caller via a new `on_delta("usage", ...)`
kind. The existing `broker.chat` buffering wrapper exposes it as a
second return value `(text, usage)`. Backward-compatible: callers
that don't handle the new kind / second value simply ignore it.
2. **Per-session accumulator on `ctx`** — running totals per-model AND
per-call-category (main / delegate / summarize / probe) accumulate on
`ctx.usage_totals`. No persistence across sessions in v1 (Q-C2
defers cross-session); the session-log JSONL files DO carry per-turn
usage so historical analysis is possible after the fact.
3. **`:cost` meta** — a `:cost` reporter that shows the current session
totals, with optional `:cost detail` for the per-model + per-category
breakdown. Zero broker calls (purely local read of `ctx.usage_totals`).
4. **Optional warning thresholds**`cfg.cost.warn_at_dollars` and
`cfg.cost.warn_at_tokens` emit a status the first time the running
total crosses the configured threshold. Default off (no warnings
without config). Useful when cloud presets are configured and you
want a "you've spent $1 this session" nudge before runaway cost.
**Phase 7 is done when:**
- `broker.chat_stream` exposes usage via the new `on_delta("usage", ...)`
callback kind; `broker.chat` returns `(text, usage)`. Backward compat
preserved (no existing caller breaks).
- After a session with mixed local + cloud calls, `:cost` prints a
total like:
```
[aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens
cost=$0.0234 (cloud only; local: 0)
```
- `:cost detail` breaks down by model + category:
```
fast main: 14 turns, 8200/2100 tokens
cloud main: 8 turns, 3850/980 tokens, $0.0180
cloud delegate: 1 turn, 250/80 tokens, $0.0012
cloud probe: 1 turn, 150/30 tokens, $0.0042
```
- Session JSONL gains a `usage` field on assistant turns (when the
broker returned one).
- With `cfg.cost.warn_at_dollars = 0.50` set, crossing $0.50 cumulative
emits exactly one status line.
- Existing configs without `cfg.cost` behave exactly like Phase 6
(Phase 6 regression coverage).
---
## 2. Technology Decisions (delta from Phase 6)
| Decision | Choice | Rationale |
|---|---|---|
| Where to extract usage | In `broker.chat_stream` event loop, looking at each SSE event's `usage` field on the final chunk | The OpenAI streaming spec puts `usage` on the FINAL chunk when `stream_options: { include_usage: true }` is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline). |
| New on_delta kind | `on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? })` | Mirrors the existing `("text", chunk)` / `("tool_call", call)` shape. Callers ignore unknown kinds; backward-compatible. |
| Where to enable usage on the wire | `opts.include_usage = true` (default `true`) sets `stream_options.include_usage = true` in the outbound request body | Off-switch for hosts that reject `stream_options`. Defaults on; baseline probe confirms current broker tolerates it. |
| Accumulator location | `ctx.usage_totals[model_name][category]` table | ctx is per-conversation; matches the `:reset`-survives-or-not rules already in place. |
| Categories | `"main"` (ask_ai), `"delegate"`, `"summarize"`, `"memory_summarize"`, `"probe"`, `"norris"` | One-tag-per-call-site. Tagged at the caller site (caller passes `opts.category` to `broker.chat_stream`). |
| Cost extraction | `usage.cost` (OpenRouter convention; dollars as a number) plus `usage.cost_details.upstream_inference_cost` (more detailed). For Anthropic/Bedrock the cost arrives in dollars on `usage.cost`. For pure local llama.cpp: no `cost` field — record 0. | Single field name across all observed providers (per baseline B7 — to be confirmed). |
| Cost precision | Store as `number` (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) | No floating-point cumulative-error concerns at this scale. |
| Warning trigger | First crossing of either threshold emits a single status: `[aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY`. Crossed-flag stored on ctx; reset only on session end / `:cost reset`. | One-shot to avoid spamming. |
| `:reset` interaction | `:reset` does NOT clear `ctx.usage_totals` (parity with `memory_items`/`project`) — the user reset their conversation, not their cost tracking. `:cost reset` is the explicit reset verb. | Matches R8 invariant from Phase 6. |
| Session-log persistence | Assistant turn entries gain an optional `usage` field when broker returned one. `history.lua` log_turn writes it through verbatim. | Per-turn granularity preserved for after-the-fact analysis. No new file. |
---
## 3. Module Changes
| File | State after Phase 6 | Phase 7 changes |
|---|---|---|
| `broker.lua` | `chat_stream(cfg, msgs, on_delta, opts)` with text + tool_call kinds; `chat` returns text | Extract usage from final SSE chunk; emit `on_delta("usage", payload)`; `chat` returns `(text, usage)`. New `opts.include_usage` (default true); new `opts.category` (passed through as a tag in the usage payload). |
| `context.lua` | system prompt + turns + memory + project + summary | Add `self.usage_totals` (table) + `self.cost_warn_fired` (bool). New helpers: `Context:add_usage(model, category, usage)`, `Context:total_cost()`, `Context:total_tokens()`. `Context:reset` does NOT clear `usage_totals` (parity with memory_items / project per R8). |
| `repl.lua` | ask_ai + delegate + summarize callbacks + Norris helpers | Wire `opts.category` at each broker call site (main / delegate / summarize / memory_summarize). Wire `on_delta("usage", ...)` -> `ctx:add_usage(...)`. New `:cost` and `:cost detail` / `:cost reset` metas. Cost-warn check after each `add_usage` call. |
| `safety.lua` | norris_step + is_destructive | Pass `opts.category = "norris"` (for the main chat_stream call) and `"probe"` (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since `safety.llm_model = "cloud"` is the recommended setting. |
| `history.lua` | session.log_turn appends JSONL entries | log_turn already takes turn opaquely; assistant turns will carry `usage` if present and it'll serialize via dkjson. No code change unless filter desired. |
| `config.lua` | example blocks for mcp/safety/memory/routing/secrets/hooks/project | Add commented-out `cost = { warn_at_dollars, warn_at_tokens }` block. |
| `docs/PHASE0.md` | §11 lists phases 0-6 | **Amendment**: add Phase 7 row to §11. |
No new module files.
---
## 4. Pillar 1 — Usage capture in broker
### SSE shape (provider-by-provider — confirm in baseline)
For OpenAI-compatible streams with `stream_options: { include_usage: true }`:
```json
data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]}
data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]}
data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}}
data: [DONE]
```
The final usage event arrives AFTER `finish_reason` but BEFORE `[DONE]`.
`choices` is empty `[]` on the usage event.
For non-streaming `chat`: usage is in the response body at the top level.
broker.chat is a wrapper around chat_stream, so it inherits the on_delta
path.
For local llama.cpp via hossenfelder: usage may or may not be present
depending on the proxy's version. Treat absence as zero-cost / unknown.
### Extraction algorithm
```lua
local final_usage = nil
local function on_event(data)
...
if doc.usage then
-- Provider sent usage; capture for emission after the stream.
final_usage = {
prompt_tokens = doc.usage.prompt_tokens or 0,
completion_tokens = doc.usage.completion_tokens or 0,
total_tokens = doc.usage.total_tokens or 0,
cost = doc.usage.cost, -- nil for local
model = doc.model or model_cfg.model,
}
-- Don't emit yet — the [DONE] event marks stream end; emit
-- once we exit the curl.post_sse loop so the caller sees
-- usage as the LAST event in the stream order.
end
-- ... existing text + tool_call handling ...
end
-- After curl.post_sse returns (stream complete):
if final_usage then on_delta("usage", final_usage) end
```
### Outbound include_usage
```lua
local body_table = { model = ..., messages = ..., stream = true }
if opts.include_usage ~= false then
body_table.stream_options = { include_usage = true }
end
```
Risk: some providers reject unrecognized fields. Baseline check; if any
host throws on `stream_options`, the per-model opt-out is one line.
### Category tagging
`opts.category` is a string set by the caller. broker echoes it into the
emitted usage payload so the accumulator knows what to credit. Default
category if absent: `"main"`.
---
## 5. Pillar 2 — Accumulator on ctx
### Shape
```lua
ctx.usage_totals = {
-- [model_name] = { [category] = { prompt = N, completion = N,
-- calls = N, cost = N } }
fast = {
main = { prompt = 1234, completion = 567, calls = 14, cost = 0 },
},
cloud = {
main = { prompt = 3850, completion = 980, calls = 8, cost = 0.0180 },
delegate = { prompt = 250, completion = 80, calls = 1, cost = 0.0012 },
probe = { prompt = 150, completion = 30, calls = 1, cost = 0.0042 },
},
}
ctx.cost_warn_fired = false
```
### add_usage
```lua
function Context:add_usage(model, category, u)
model = model or "?"
category = category or "main"
self.usage_totals = self.usage_totals or {}
local m = self.usage_totals[model] or {}
local c = m[category] or { prompt = 0, completion = 0, calls = 0, cost = 0 }
c.prompt = c.prompt + (u.prompt_tokens or 0)
c.completion = c.completion + (u.completion_tokens or 0)
c.calls = c.calls + 1
c.cost = c.cost + (u.cost or 0)
m[category] = c
self.usage_totals[model] = m
end
function Context:total_cost()
local total = 0
for _, m in pairs(self.usage_totals or {}) do
for _, c in pairs(m) do total = total + c.cost end
end
return total
end
function Context:total_tokens()
local p, comp = 0, 0
for _, m in pairs(self.usage_totals or {}) do
for _, c in pairs(m) do
p = p + c.prompt
comp = comp + c.completion
end
end
return p, comp
end
```
### Reset semantics
`Context:reset()` deliberately does NOT clear `usage_totals` —
matches R8 invariant from Phase 6 (`:reset` clears `turns`,
`pending_exec_output`, `summary`; preserves `memory_items`, `project`,
and now `usage_totals`). The user reset their conversation, not their
cost meter. `:cost reset` is the explicit reset verb for the meter.
---
## 6. Pillar 3 — `:cost` meta
```
:cost summary line
:cost detail per-model + per-category breakdown
:cost reset zero out ctx.usage_totals + cost_warn_fired
```
Summary format:
```
[aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens
cost=$0.0234 (cloud only; local: 0)
```
Detail format (sorted by total cost desc, then by model):
```
[aish] session usage detail:
cloud main 8 calls, 3,850 / 980 tokens, $0.0180
cloud delegate 1 call, 250 / 80 tokens, $0.0012
cloud probe 1 call, 150 / 30 tokens, $0.0042
fast main 14 calls, 8,200 / 2,100 tokens, $0 (local)
```
Implementation: pure Lua iteration over `ctx.usage_totals`; no broker
calls. Sorting uses `table.sort` on a flattened list.
---
## 7. Pillar 4 — Warning thresholds
Config:
```lua
cost = {
warn_at_dollars = 0.50, -- emit once when cumulative cost crosses
warn_at_tokens = 100000, -- emit once when cumulative tokens crosses
}
```
After every `ctx:add_usage`, check:
```lua
if config.cost and not ctx.cost_warn_fired then
local cost = ctx:total_cost()
if config.cost.warn_at_dollars and cost >= config.cost.warn_at_dollars then
renderer.status(("session cost $%.4f has crossed warn_at_dollars=$%.4f")
:format(cost, config.cost.warn_at_dollars))
ctx.cost_warn_fired = true
end
-- (similar for warn_at_tokens; share the flag or use two)
end
```
One-shot per session. `:cost reset` clears the flag.
---
## 8. UX Surface Summary
| Meta | Behavior |
|---|---|
| `:cost` | One-line summary: calls / tokens / cost |
| `:cost detail` | Per-model + per-category breakdown |
| `:cost reset` | Zero out totals + clear warn-fired flag |
| Config | Default | Effect |
|---|---|---|
| `cfg.cost.warn_at_dollars` | nil | Status when cumulative cost first crosses this dollar amount |
| `cfg.cost.warn_at_tokens` | nil | Status when cumulative total tokens first crosses |
| (broker `opts.include_usage`) | true | Adds `stream_options.include_usage = true` to outbound request |
---
## 9. Out of Scope (Phase 7)
- **Cross-session cost persistence** — Q-C2 defers `<history.dir>/cost.jsonl`
rollup; v1 is session-only. Per-turn usage IS in the session JSONL for
after-the-fact aggregation if anyone wants to script it.
- **Per-model rate limiting / cost caps that REFUSE the call** — v1 only
warns. A future phase could add a hard cap that aborts before the
broker call.
- **Pricing-table fallback for local models** — if a local model doesn't
emit `usage.cost`, we record 0. Estimating cost from token count + a
static pricing table is a future polish (most users won't care about
local "cost" anyway — local is free).
- **Pretty token-bandwidth charts / sparklines** — out of scope; the
detail breakdown is text-only.
- **Estimated cost for future turns** — no preflight cost prediction.
- **MCP tool-call usage** — MCP servers don't expose token usage;
broker calls invoked DURING MCP tool dispatch ARE captured (because
they go through the same path), but the MCP tool call itself isn't.
---
## 10. Risks
| Risk | Mitigation |
|---|---|
| Some providers reject `stream_options` -> SSE errors at the top of the stream | `opts.include_usage = false` opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior |
| OpenRouter `cost` field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) | Capture `usage.cost` as-is (number); document that the same provider must be used for cross-call comparison |
| Local llama.cpp returns no `cost` -> displayed `$0` could mislead user "is this REALLY free?" | `:cost detail` annotates local lines with `(local)` literal; summary says `cost=$X (cloud only; local: 0)` |
| `ctx.usage_totals` grows unboundedly with new model names mid-session | Bounded by `#models in config` × `#categories` — small constants. No mitigation needed. |
| Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold | Acceptable for v1; user can `:cost reset` to re-arm. Future polish: warn at each Nx multiple. |
---
## 11. Open Questions (Phase 7)
| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-C1 | What to do when a provider doesn't emit `usage` (local llama.cpp, some misconfigured proxies)? Record zero / silently skip / estimate via char/4? | Accumulator entries for those models | Analyze (probe what hossenfelder actually returns per model class) |
| Q-C2 | Should `<history.dir>/cost.jsonl` accumulate cross-session totals? `:cost --all-time` reporter? | New file + persistence layer | Defer to follow-up phase (v1 = session only) |
| Q-C3 | Categories vs free-form tags — should `opts.category` be free-form (caller decides) or validated against a closed set? | Predictability of `:cost detail` output | Analyze |
| Q-C4 | Does the hossenfelder broker forward `stream_options` to all backends? Some backends may strip it. | include_usage default | Baseline (real probe) |
| Q-C5 | Should `cfg.cost.warn_at_dollars` trigger on the FIRST broker call that crosses, or on the NEXT one (so user sees the call complete before the warn)? | UX detail | Analyze |
| Q-C6 | When `:reset` is called, should `cost_warn_fired` be cleared too (so the next warn-cross fires again)? Or only clear via `:cost reset`? | Reset surface area | Analyze |
---
## 12. Phase 7 → Phase 8+ Out-of-band
Candidate follow-ups (non-binding):
- **Phase 8**: cross-session cost persistence (Q-C2 deferral), with
optional cost dashboards / weekly rollup reporter.
- **Hard rate limits / cost caps that REFUSE the call** — an extension
of the warn surface that promotes warnings into preflight enforcement.
- **Better tokenization** (Q1 deferred-from-Phase-3): replace the char/4
heuristic on `Context:estimate_tokens()` with model `/tokenize` calls.
Indirectly improves accuracy of any future "preflight cost predictor".
Phase 7 itself is self-contained — no upstream dependencies.