Phase 7 formulate manifest + PHASE0 §11 amendment to add the Phase 7
row (substrate amendment per CLAUDE.md §3, lands in the same commit).
Four pillars:
1. Usage capture in broker.chat_stream — extract `usage` from the
final SSE chunk (OpenAI streaming spec with `stream_options:
{include_usage: true}`). Surface via new on_delta("usage",
payload) kind. broker.chat returns (text, usage) — backward-
compat: existing callers ignore the second value.
2. Per-session accumulator on ctx — ctx.usage_totals[model][category]
tables (categories: main / delegate / summarize / memory_summarize
/ probe / norris, tagged at the call site via opts.category).
:reset preserves usage_totals (R8 parity with memory_items /
project). Session JSONL gains an optional `usage` field on
assistant turns for after-the-fact analysis.
3. :cost meta surface — :cost (summary), :cost detail (per-model +
per-category breakdown), :cost reset (zero the meter). Pure-Lua
read of ctx.usage_totals; no broker calls.
4. Optional warn thresholds — cfg.cost.warn_at_dollars /
warn_at_tokens emit a one-shot status when crossed. Default off;
useful with cloud presets configured.
Doc covers scope + done-when criteria, tech decisions table, module
changes, per-pillar deep dive with code sketches, UX surface, out of
scope, risks, 6 open questions to resolve in analyze.
Open at formulate:
Q-C1 — provider-without-usage handling (local llama.cpp probably)
Q-C2 — cross-session persistence (defer to phase 8)
Q-C3 — categories closed-set vs free-form
Q-C4 — does hossenfelder forward stream_options to all backends?
Q-C5 — warn fires on the call that crosses, or the next one?
Q-C6 — :reset clears cost_warn_fired too, or only :cost reset?
Scope confirmed via AskUserQuestion: cost/usage observability
(chosen over project-local config overlay and session search/tag).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
18 KiB
aish — Phase 7 Manifest
Project: aish — AI-augmented conversational shell Document: Phase 7 Requirements, Architecture & Design Decisions Status: Formulate (pre-analyze) Date: 2026-05-16
PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest specifies what Phase 7 adds — cost / usage observability: the ability to know, mid-session, how many tokens you've spent and how much money the paid-cloud calls have cost.
PHASE0 §11 originally listed phases only through 6; this commit amends §11 to add Phase 7.
1. Scope of Phase 7
Four pillars:
-
Usage capture in broker —
broker.chat_streamextracts the provider'susageblock (andcostwhere present) from the response stream. Surfaces it to the caller via a newon_delta("usage", ...)kind. The existingbroker.chatbuffering wrapper exposes it as a second return value(text, usage). Backward-compatible: callers that don't handle the new kind / second value simply ignore it. -
Per-session accumulator on
ctx— running totals per-model AND per-call-category (main / delegate / summarize / probe) accumulate onctx.usage_totals. No persistence across sessions in v1 (Q-C2 defers cross-session); the session-log JSONL files DO carry per-turn usage so historical analysis is possible after the fact. -
:costmeta — a:costreporter that shows the current session totals, with optional:cost detailfor the per-model + per-category breakdown. Zero broker calls (purely local read ofctx.usage_totals). -
Optional warning thresholds —
cfg.cost.warn_at_dollarsandcfg.cost.warn_at_tokensemit a status the first time the running total crosses the configured threshold. Default off (no warnings without config). Useful when cloud presets are configured and you want a "you've spent $1 this session" nudge before runaway cost.
Phase 7 is done when:
broker.chat_streamexposes usage via the newon_delta("usage", ...)callback kind;broker.chatreturns(text, usage). Backward compat preserved (no existing caller breaks).- After a session with mixed local + cloud calls,
:costprints a total like:[aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens cost=$0.0234 (cloud only; local: 0) :cost detailbreaks down by model + category:fast main: 14 turns, 8200/2100 tokens cloud main: 8 turns, 3850/980 tokens, $0.0180 cloud delegate: 1 turn, 250/80 tokens, $0.0012 cloud probe: 1 turn, 150/30 tokens, $0.0042- Session JSONL gains a
usagefield on assistant turns (when the broker returned one). - With
cfg.cost.warn_at_dollars = 0.50set, crossing $0.50 cumulative emits exactly one status line. - Existing configs without
cfg.costbehave exactly like Phase 6 (Phase 6 regression coverage).
2. Technology Decisions (delta from Phase 6)
| Decision | Choice | Rationale |
|---|---|---|
| Where to extract usage | In broker.chat_stream event loop, looking at each SSE event's usage field on the final chunk |
The OpenAI streaming spec puts usage on the FINAL chunk when stream_options: { include_usage: true } is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline). |
| New on_delta kind | on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? }) |
Mirrors the existing ("text", chunk) / ("tool_call", call) shape. Callers ignore unknown kinds; backward-compatible. |
| Where to enable usage on the wire | opts.include_usage = true (default true) sets stream_options.include_usage = true in the outbound request body |
Off-switch for hosts that reject stream_options. Defaults on; baseline probe confirms current broker tolerates it. |
| Accumulator location | ctx.usage_totals[model_name][category] table |
ctx is per-conversation; matches the :reset-survives-or-not rules already in place. |
| Categories | "main" (ask_ai), "delegate", "summarize", "memory_summarize", "probe", "norris" |
One-tag-per-call-site. Tagged at the caller site (caller passes opts.category to broker.chat_stream). |
| Cost extraction | usage.cost (OpenRouter convention; dollars as a number) plus usage.cost_details.upstream_inference_cost (more detailed). For Anthropic/Bedrock the cost arrives in dollars on usage.cost. For pure local llama.cpp: no cost field — record 0. |
Single field name across all observed providers (per baseline B7 — to be confirmed). |
| Cost precision | Store as number (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) |
No floating-point cumulative-error concerns at this scale. |
| Warning trigger | First crossing of either threshold emits a single status: [aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY. Crossed-flag stored on ctx; reset only on session end / :cost reset. |
One-shot to avoid spamming. |
:reset interaction |
:reset does NOT clear ctx.usage_totals (parity with memory_items/project) — the user reset their conversation, not their cost tracking. :cost reset is the explicit reset verb. |
Matches R8 invariant from Phase 6. |
| Session-log persistence | Assistant turn entries gain an optional usage field when broker returned one. history.lua log_turn writes it through verbatim. |
Per-turn granularity preserved for after-the-fact analysis. No new file. |
3. Module Changes
| File | State after Phase 6 | Phase 7 changes |
|---|---|---|
broker.lua |
chat_stream(cfg, msgs, on_delta, opts) with text + tool_call kinds; chat returns text |
Extract usage from final SSE chunk; emit on_delta("usage", payload); chat returns (text, usage). New opts.include_usage (default true); new opts.category (passed through as a tag in the usage payload). |
context.lua |
system prompt + turns + memory + project + summary | Add self.usage_totals (table) + self.cost_warn_fired (bool). New helpers: Context:add_usage(model, category, usage), Context:total_cost(), Context:total_tokens(). Context:reset does NOT clear usage_totals (parity with memory_items / project per R8). |
repl.lua |
ask_ai + delegate + summarize callbacks + Norris helpers | Wire opts.category at each broker call site (main / delegate / summarize / memory_summarize). Wire on_delta("usage", ...) -> ctx:add_usage(...). New :cost and :cost detail / :cost reset metas. Cost-warn check after each add_usage call. |
safety.lua |
norris_step + is_destructive | Pass opts.category = "norris" (for the main chat_stream call) and "probe" (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since safety.llm_model = "cloud" is the recommended setting. |
history.lua |
session.log_turn appends JSONL entries | log_turn already takes turn opaquely; assistant turns will carry usage if present and it'll serialize via dkjson. No code change unless filter desired. |
config.lua |
example blocks for mcp/safety/memory/routing/secrets/hooks/project | Add commented-out cost = { warn_at_dollars, warn_at_tokens } block. |
docs/PHASE0.md |
§11 lists phases 0-6 | Amendment: add Phase 7 row to §11. |
No new module files.
4. Pillar 1 — Usage capture in broker
SSE shape (provider-by-provider — confirm in baseline)
For OpenAI-compatible streams with stream_options: { include_usage: true }:
data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]}
data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]}
data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}}
data: [DONE]
The final usage event arrives AFTER finish_reason but BEFORE [DONE].
choices is empty [] on the usage event.
For non-streaming chat: usage is in the response body at the top level.
broker.chat is a wrapper around chat_stream, so it inherits the on_delta
path.
For local llama.cpp via hossenfelder: usage may or may not be present depending on the proxy's version. Treat absence as zero-cost / unknown.
Extraction algorithm
local final_usage = nil
local function on_event(data)
...
if doc.usage then
-- Provider sent usage; capture for emission after the stream.
final_usage = {
prompt_tokens = doc.usage.prompt_tokens or 0,
completion_tokens = doc.usage.completion_tokens or 0,
total_tokens = doc.usage.total_tokens or 0,
cost = doc.usage.cost, -- nil for local
model = doc.model or model_cfg.model,
}
-- Don't emit yet — the [DONE] event marks stream end; emit
-- once we exit the curl.post_sse loop so the caller sees
-- usage as the LAST event in the stream order.
end
-- ... existing text + tool_call handling ...
end
-- After curl.post_sse returns (stream complete):
if final_usage then on_delta("usage", final_usage) end
Outbound include_usage
local body_table = { model = ..., messages = ..., stream = true }
if opts.include_usage ~= false then
body_table.stream_options = { include_usage = true }
end
Risk: some providers reject unrecognized fields. Baseline check; if any
host throws on stream_options, the per-model opt-out is one line.
Category tagging
opts.category is a string set by the caller. broker echoes it into the
emitted usage payload so the accumulator knows what to credit. Default
category if absent: "main".
5. Pillar 2 — Accumulator on ctx
Shape
ctx.usage_totals = {
-- [model_name] = { [category] = { prompt = N, completion = N,
-- calls = N, cost = N } }
fast = {
main = { prompt = 1234, completion = 567, calls = 14, cost = 0 },
},
cloud = {
main = { prompt = 3850, completion = 980, calls = 8, cost = 0.0180 },
delegate = { prompt = 250, completion = 80, calls = 1, cost = 0.0012 },
probe = { prompt = 150, completion = 30, calls = 1, cost = 0.0042 },
},
}
ctx.cost_warn_fired = false
add_usage
function Context:add_usage(model, category, u)
model = model or "?"
category = category or "main"
self.usage_totals = self.usage_totals or {}
local m = self.usage_totals[model] or {}
local c = m[category] or { prompt = 0, completion = 0, calls = 0, cost = 0 }
c.prompt = c.prompt + (u.prompt_tokens or 0)
c.completion = c.completion + (u.completion_tokens or 0)
c.calls = c.calls + 1
c.cost = c.cost + (u.cost or 0)
m[category] = c
self.usage_totals[model] = m
end
function Context:total_cost()
local total = 0
for _, m in pairs(self.usage_totals or {}) do
for _, c in pairs(m) do total = total + c.cost end
end
return total
end
function Context:total_tokens()
local p, comp = 0, 0
for _, m in pairs(self.usage_totals or {}) do
for _, c in pairs(m) do
p = p + c.prompt
comp = comp + c.completion
end
end
return p, comp
end
Reset semantics
Context:reset() deliberately does NOT clear usage_totals —
matches R8 invariant from Phase 6 (:reset clears turns,
pending_exec_output, summary; preserves memory_items, project,
and now usage_totals). The user reset their conversation, not their
cost meter. :cost reset is the explicit reset verb for the meter.
6. Pillar 3 — :cost meta
:cost summary line
:cost detail per-model + per-category breakdown
:cost reset zero out ctx.usage_totals + cost_warn_fired
Summary format:
[aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens
cost=$0.0234 (cloud only; local: 0)
Detail format (sorted by total cost desc, then by model):
[aish] session usage detail:
cloud main 8 calls, 3,850 / 980 tokens, $0.0180
cloud delegate 1 call, 250 / 80 tokens, $0.0012
cloud probe 1 call, 150 / 30 tokens, $0.0042
fast main 14 calls, 8,200 / 2,100 tokens, $0 (local)
Implementation: pure Lua iteration over ctx.usage_totals; no broker
calls. Sorting uses table.sort on a flattened list.
7. Pillar 4 — Warning thresholds
Config:
cost = {
warn_at_dollars = 0.50, -- emit once when cumulative cost crosses
warn_at_tokens = 100000, -- emit once when cumulative tokens crosses
}
After every ctx:add_usage, check:
if config.cost and not ctx.cost_warn_fired then
local cost = ctx:total_cost()
if config.cost.warn_at_dollars and cost >= config.cost.warn_at_dollars then
renderer.status(("session cost $%.4f has crossed warn_at_dollars=$%.4f")
:format(cost, config.cost.warn_at_dollars))
ctx.cost_warn_fired = true
end
-- (similar for warn_at_tokens; share the flag or use two)
end
One-shot per session. :cost reset clears the flag.
8. UX Surface Summary
| Meta | Behavior |
|---|---|
:cost |
One-line summary: calls / tokens / cost |
:cost detail |
Per-model + per-category breakdown |
:cost reset |
Zero out totals + clear warn-fired flag |
| Config | Default | Effect |
|---|---|---|
cfg.cost.warn_at_dollars |
nil | Status when cumulative cost first crosses this dollar amount |
cfg.cost.warn_at_tokens |
nil | Status when cumulative total tokens first crosses |
(broker opts.include_usage) |
true | Adds stream_options.include_usage = true to outbound request |
9. Out of Scope (Phase 7)
- Cross-session cost persistence — Q-C2 defers
<history.dir>/cost.jsonlrollup; v1 is session-only. Per-turn usage IS in the session JSONL for after-the-fact aggregation if anyone wants to script it. - Per-model rate limiting / cost caps that REFUSE the call — v1 only warns. A future phase could add a hard cap that aborts before the broker call.
- Pricing-table fallback for local models — if a local model doesn't
emit
usage.cost, we record 0. Estimating cost from token count + a static pricing table is a future polish (most users won't care about local "cost" anyway — local is free). - Pretty token-bandwidth charts / sparklines — out of scope; the detail breakdown is text-only.
- Estimated cost for future turns — no preflight cost prediction.
- MCP tool-call usage — MCP servers don't expose token usage; broker calls invoked DURING MCP tool dispatch ARE captured (because they go through the same path), but the MCP tool call itself isn't.
10. Risks
| Risk | Mitigation |
|---|---|
Some providers reject stream_options -> SSE errors at the top of the stream |
opts.include_usage = false opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior |
OpenRouter cost field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) |
Capture usage.cost as-is (number); document that the same provider must be used for cross-call comparison |
Local llama.cpp returns no cost -> displayed $0 could mislead user "is this REALLY free?" |
:cost detail annotates local lines with (local) literal; summary says cost=$X (cloud only; local: 0) |
ctx.usage_totals grows unboundedly with new model names mid-session |
Bounded by #models in config × #categories — small constants. No mitigation needed. |
| Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold | Acceptable for v1; user can :cost reset to re-arm. Future polish: warn at each Nx multiple. |
11. Open Questions (Phase 7)
| # | Question | Impact | Resolution target |
|---|---|---|---|
| Q-C1 | What to do when a provider doesn't emit usage (local llama.cpp, some misconfigured proxies)? Record zero / silently skip / estimate via char/4? |
Accumulator entries for those models | Analyze (probe what hossenfelder actually returns per model class) |
| Q-C2 | Should <history.dir>/cost.jsonl accumulate cross-session totals? :cost --all-time reporter? |
New file + persistence layer | Defer to follow-up phase (v1 = session only) |
| Q-C3 | Categories vs free-form tags — should opts.category be free-form (caller decides) or validated against a closed set? |
Predictability of :cost detail output |
Analyze |
| Q-C4 | Does the hossenfelder broker forward stream_options to all backends? Some backends may strip it. |
include_usage default | Baseline (real probe) |
| Q-C5 | Should cfg.cost.warn_at_dollars trigger on the FIRST broker call that crosses, or on the NEXT one (so user sees the call complete before the warn)? |
UX detail | Analyze |
| Q-C6 | When :reset is called, should cost_warn_fired be cleared too (so the next warn-cross fires again)? Or only clear via :cost reset? |
Reset surface area | Analyze |
12. Phase 7 → Phase 8+ Out-of-band
Candidate follow-ups (non-binding):
- Phase 8: cross-session cost persistence (Q-C2 deferral), with optional cost dashboards / weekly rollup reporter.
- Hard rate limits / cost caps that REFUSE the call — an extension of the warn surface that promotes warnings into preflight enforcement.
- Better tokenization (Q1 deferred-from-Phase-3): replace the char/4
heuristic on
Context:estimate_tokens()with model/tokenizecalls. Indirectly improves accuracy of any future "preflight cost predictor".
Phase 7 itself is self-contained — no upstream dependencies.