docs/PHASE7: formulate — cost / usage observability
Phase 7 formulate manifest + PHASE0 §11 amendment to add the Phase 7
row (substrate amendment per CLAUDE.md §3, lands in the same commit).
Four pillars:
1. Usage capture in broker.chat_stream — extract `usage` from the
final SSE chunk (OpenAI streaming spec with `stream_options:
{include_usage: true}`). Surface via new on_delta("usage",
payload) kind. broker.chat returns (text, usage) — backward-
compat: existing callers ignore the second value.
2. Per-session accumulator on ctx — ctx.usage_totals[model][category]
tables (categories: main / delegate / summarize / memory_summarize
/ probe / norris, tagged at the call site via opts.category).
:reset preserves usage_totals (R8 parity with memory_items /
project). Session JSONL gains an optional `usage` field on
assistant turns for after-the-fact analysis.
3. :cost meta surface — :cost (summary), :cost detail (per-model +
per-category breakdown), :cost reset (zero the meter). Pure-Lua
read of ctx.usage_totals; no broker calls.
4. Optional warn thresholds — cfg.cost.warn_at_dollars /
warn_at_tokens emit a one-shot status when crossed. Default off;
useful with cloud presets configured.
Doc covers scope + done-when criteria, tech decisions table, module
changes, per-pillar deep dive with code sketches, UX surface, out of
scope, risks, 6 open questions to resolve in analyze.
Open at formulate:
Q-C1 — provider-without-usage handling (local llama.cpp probably)
Q-C2 — cross-session persistence (defer to phase 8)
Q-C3 — categories closed-set vs free-form
Q-C4 — does hossenfelder forward stream_options to all backends?
Q-C5 — warn fires on the call that crosses, or the next one?
Q-C6 — :reset clears cost_warn_fired too, or only :cost reset?
Scope confirmed via AskUserQuestion: cost/usage observability
(chosen over project-local config overlay and session search/tag).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -316,6 +316,7 @@ from somewhere else.
|
||||
| **4** | `memory.jsonl` summarization, startup context injection from memory, `:history` management, pruning |
|
||||
| **5** | Multi-model routing by task type, cloud fallback, context summarization via fast model on eviction |
|
||||
| **6** | Tree-sitter syntax highlighting hooks, diff-aware code injection, project-level context (file tree summary) |
|
||||
| **7** | Cost / usage observability: broker captures `usage` + `cost`; per-session accumulator on ctx; `:cost` reporter; optional warn thresholds |
|
||||
|
||||
---
|
||||
|
||||
|
||||
+375
@@ -0,0 +1,375 @@
|
||||
# aish — Phase 7 Manifest
|
||||
|
||||
**Project:** aish — AI-augmented conversational shell
|
||||
**Document:** Phase 7 Requirements, Architecture & Design Decisions
|
||||
**Status:** Formulate (pre-analyze)
|
||||
**Date:** 2026-05-16
|
||||
|
||||
PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest
|
||||
specifies what Phase 7 adds — **cost / usage observability**: the ability
|
||||
to know, mid-session, how many tokens you've spent and how much money the
|
||||
paid-cloud calls have cost.
|
||||
|
||||
PHASE0 §11 originally listed phases only through 6; this commit amends
|
||||
§11 to add Phase 7.
|
||||
|
||||
---
|
||||
|
||||
## 1. Scope of Phase 7
|
||||
|
||||
Four pillars:
|
||||
|
||||
1. **Usage capture in broker** — `broker.chat_stream` extracts the
|
||||
provider's `usage` block (and `cost` where present) from the response
|
||||
stream. Surfaces it to the caller via a new `on_delta("usage", ...)`
|
||||
kind. The existing `broker.chat` buffering wrapper exposes it as a
|
||||
second return value `(text, usage)`. Backward-compatible: callers
|
||||
that don't handle the new kind / second value simply ignore it.
|
||||
|
||||
2. **Per-session accumulator on `ctx`** — running totals per-model AND
|
||||
per-call-category (main / delegate / summarize / probe) accumulate on
|
||||
`ctx.usage_totals`. No persistence across sessions in v1 (Q-C2
|
||||
defers cross-session); the session-log JSONL files DO carry per-turn
|
||||
usage so historical analysis is possible after the fact.
|
||||
|
||||
3. **`:cost` meta** — a `:cost` reporter that shows the current session
|
||||
totals, with optional `:cost detail` for the per-model + per-category
|
||||
breakdown. Zero broker calls (purely local read of `ctx.usage_totals`).
|
||||
|
||||
4. **Optional warning thresholds** — `cfg.cost.warn_at_dollars` and
|
||||
`cfg.cost.warn_at_tokens` emit a status the first time the running
|
||||
total crosses the configured threshold. Default off (no warnings
|
||||
without config). Useful when cloud presets are configured and you
|
||||
want a "you've spent $1 this session" nudge before runaway cost.
|
||||
|
||||
**Phase 7 is done when:**
|
||||
|
||||
- `broker.chat_stream` exposes usage via the new `on_delta("usage", ...)`
|
||||
callback kind; `broker.chat` returns `(text, usage)`. Backward compat
|
||||
preserved (no existing caller breaks).
|
||||
- After a session with mixed local + cloud calls, `:cost` prints a
|
||||
total like:
|
||||
```
|
||||
[aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens
|
||||
cost=$0.0234 (cloud only; local: 0)
|
||||
```
|
||||
- `:cost detail` breaks down by model + category:
|
||||
```
|
||||
fast main: 14 turns, 8200/2100 tokens
|
||||
cloud main: 8 turns, 3850/980 tokens, $0.0180
|
||||
cloud delegate: 1 turn, 250/80 tokens, $0.0012
|
||||
cloud probe: 1 turn, 150/30 tokens, $0.0042
|
||||
```
|
||||
- Session JSONL gains a `usage` field on assistant turns (when the
|
||||
broker returned one).
|
||||
- With `cfg.cost.warn_at_dollars = 0.50` set, crossing $0.50 cumulative
|
||||
emits exactly one status line.
|
||||
- Existing configs without `cfg.cost` behave exactly like Phase 6
|
||||
(Phase 6 regression coverage).
|
||||
|
||||
---
|
||||
|
||||
## 2. Technology Decisions (delta from Phase 6)
|
||||
|
||||
| Decision | Choice | Rationale |
|
||||
|---|---|---|
|
||||
| Where to extract usage | In `broker.chat_stream` event loop, looking at each SSE event's `usage` field on the final chunk | The OpenAI streaming spec puts `usage` on the FINAL chunk when `stream_options: { include_usage: true }` is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline). |
|
||||
| New on_delta kind | `on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? })` | Mirrors the existing `("text", chunk)` / `("tool_call", call)` shape. Callers ignore unknown kinds; backward-compatible. |
|
||||
| Where to enable usage on the wire | `opts.include_usage = true` (default `true`) sets `stream_options.include_usage = true` in the outbound request body | Off-switch for hosts that reject `stream_options`. Defaults on; baseline probe confirms current broker tolerates it. |
|
||||
| Accumulator location | `ctx.usage_totals[model_name][category]` table | ctx is per-conversation; matches the `:reset`-survives-or-not rules already in place. |
|
||||
| Categories | `"main"` (ask_ai), `"delegate"`, `"summarize"`, `"memory_summarize"`, `"probe"`, `"norris"` | One-tag-per-call-site. Tagged at the caller site (caller passes `opts.category` to `broker.chat_stream`). |
|
||||
| Cost extraction | `usage.cost` (OpenRouter convention; dollars as a number) plus `usage.cost_details.upstream_inference_cost` (more detailed). For Anthropic/Bedrock the cost arrives in dollars on `usage.cost`. For pure local llama.cpp: no `cost` field — record 0. | Single field name across all observed providers (per baseline B7 — to be confirmed). |
|
||||
| Cost precision | Store as `number` (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) | No floating-point cumulative-error concerns at this scale. |
|
||||
| Warning trigger | First crossing of either threshold emits a single status: `[aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY`. Crossed-flag stored on ctx; reset only on session end / `:cost reset`. | One-shot to avoid spamming. |
|
||||
| `:reset` interaction | `:reset` does NOT clear `ctx.usage_totals` (parity with `memory_items`/`project`) — the user reset their conversation, not their cost tracking. `:cost reset` is the explicit reset verb. | Matches R8 invariant from Phase 6. |
|
||||
| Session-log persistence | Assistant turn entries gain an optional `usage` field when broker returned one. `history.lua` log_turn writes it through verbatim. | Per-turn granularity preserved for after-the-fact analysis. No new file. |
|
||||
|
||||
---
|
||||
|
||||
## 3. Module Changes
|
||||
|
||||
| File | State after Phase 6 | Phase 7 changes |
|
||||
|---|---|---|
|
||||
| `broker.lua` | `chat_stream(cfg, msgs, on_delta, opts)` with text + tool_call kinds; `chat` returns text | Extract usage from final SSE chunk; emit `on_delta("usage", payload)`; `chat` returns `(text, usage)`. New `opts.include_usage` (default true); new `opts.category` (passed through as a tag in the usage payload). |
|
||||
| `context.lua` | system prompt + turns + memory + project + summary | Add `self.usage_totals` (table) + `self.cost_warn_fired` (bool). New helpers: `Context:add_usage(model, category, usage)`, `Context:total_cost()`, `Context:total_tokens()`. `Context:reset` does NOT clear `usage_totals` (parity with memory_items / project per R8). |
|
||||
| `repl.lua` | ask_ai + delegate + summarize callbacks + Norris helpers | Wire `opts.category` at each broker call site (main / delegate / summarize / memory_summarize). Wire `on_delta("usage", ...)` -> `ctx:add_usage(...)`. New `:cost` and `:cost detail` / `:cost reset` metas. Cost-warn check after each `add_usage` call. |
|
||||
| `safety.lua` | norris_step + is_destructive | Pass `opts.category = "norris"` (for the main chat_stream call) and `"probe"` (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since `safety.llm_model = "cloud"` is the recommended setting. |
|
||||
| `history.lua` | session.log_turn appends JSONL entries | log_turn already takes turn opaquely; assistant turns will carry `usage` if present and it'll serialize via dkjson. No code change unless filter desired. |
|
||||
| `config.lua` | example blocks for mcp/safety/memory/routing/secrets/hooks/project | Add commented-out `cost = { warn_at_dollars, warn_at_tokens }` block. |
|
||||
| `docs/PHASE0.md` | §11 lists phases 0-6 | **Amendment**: add Phase 7 row to §11. |
|
||||
|
||||
No new module files.
|
||||
|
||||
---
|
||||
|
||||
## 4. Pillar 1 — Usage capture in broker
|
||||
|
||||
### SSE shape (provider-by-provider — confirm in baseline)
|
||||
|
||||
For OpenAI-compatible streams with `stream_options: { include_usage: true }`:
|
||||
|
||||
```json
|
||||
data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]}
|
||||
data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]}
|
||||
data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}}
|
||||
data: [DONE]
|
||||
```
|
||||
|
||||
The final usage event arrives AFTER `finish_reason` but BEFORE `[DONE]`.
|
||||
`choices` is empty `[]` on the usage event.
|
||||
|
||||
For non-streaming `chat`: usage is in the response body at the top level.
|
||||
broker.chat is a wrapper around chat_stream, so it inherits the on_delta
|
||||
path.
|
||||
|
||||
For local llama.cpp via hossenfelder: usage may or may not be present
|
||||
depending on the proxy's version. Treat absence as zero-cost / unknown.
|
||||
|
||||
### Extraction algorithm
|
||||
|
||||
```lua
|
||||
local final_usage = nil
|
||||
|
||||
local function on_event(data)
|
||||
...
|
||||
if doc.usage then
|
||||
-- Provider sent usage; capture for emission after the stream.
|
||||
final_usage = {
|
||||
prompt_tokens = doc.usage.prompt_tokens or 0,
|
||||
completion_tokens = doc.usage.completion_tokens or 0,
|
||||
total_tokens = doc.usage.total_tokens or 0,
|
||||
cost = doc.usage.cost, -- nil for local
|
||||
model = doc.model or model_cfg.model,
|
||||
}
|
||||
-- Don't emit yet — the [DONE] event marks stream end; emit
|
||||
-- once we exit the curl.post_sse loop so the caller sees
|
||||
-- usage as the LAST event in the stream order.
|
||||
end
|
||||
-- ... existing text + tool_call handling ...
|
||||
end
|
||||
|
||||
-- After curl.post_sse returns (stream complete):
|
||||
if final_usage then on_delta("usage", final_usage) end
|
||||
```
|
||||
|
||||
### Outbound include_usage
|
||||
|
||||
```lua
|
||||
local body_table = { model = ..., messages = ..., stream = true }
|
||||
if opts.include_usage ~= false then
|
||||
body_table.stream_options = { include_usage = true }
|
||||
end
|
||||
```
|
||||
|
||||
Risk: some providers reject unrecognized fields. Baseline check; if any
|
||||
host throws on `stream_options`, the per-model opt-out is one line.
|
||||
|
||||
### Category tagging
|
||||
|
||||
`opts.category` is a string set by the caller. broker echoes it into the
|
||||
emitted usage payload so the accumulator knows what to credit. Default
|
||||
category if absent: `"main"`.
|
||||
|
||||
---
|
||||
|
||||
## 5. Pillar 2 — Accumulator on ctx
|
||||
|
||||
### Shape
|
||||
|
||||
```lua
|
||||
ctx.usage_totals = {
|
||||
-- [model_name] = { [category] = { prompt = N, completion = N,
|
||||
-- calls = N, cost = N } }
|
||||
fast = {
|
||||
main = { prompt = 1234, completion = 567, calls = 14, cost = 0 },
|
||||
},
|
||||
cloud = {
|
||||
main = { prompt = 3850, completion = 980, calls = 8, cost = 0.0180 },
|
||||
delegate = { prompt = 250, completion = 80, calls = 1, cost = 0.0012 },
|
||||
probe = { prompt = 150, completion = 30, calls = 1, cost = 0.0042 },
|
||||
},
|
||||
}
|
||||
ctx.cost_warn_fired = false
|
||||
```
|
||||
|
||||
### add_usage
|
||||
|
||||
```lua
|
||||
function Context:add_usage(model, category, u)
|
||||
model = model or "?"
|
||||
category = category or "main"
|
||||
self.usage_totals = self.usage_totals or {}
|
||||
local m = self.usage_totals[model] or {}
|
||||
local c = m[category] or { prompt = 0, completion = 0, calls = 0, cost = 0 }
|
||||
c.prompt = c.prompt + (u.prompt_tokens or 0)
|
||||
c.completion = c.completion + (u.completion_tokens or 0)
|
||||
c.calls = c.calls + 1
|
||||
c.cost = c.cost + (u.cost or 0)
|
||||
m[category] = c
|
||||
self.usage_totals[model] = m
|
||||
end
|
||||
|
||||
function Context:total_cost()
|
||||
local total = 0
|
||||
for _, m in pairs(self.usage_totals or {}) do
|
||||
for _, c in pairs(m) do total = total + c.cost end
|
||||
end
|
||||
return total
|
||||
end
|
||||
|
||||
function Context:total_tokens()
|
||||
local p, comp = 0, 0
|
||||
for _, m in pairs(self.usage_totals or {}) do
|
||||
for _, c in pairs(m) do
|
||||
p = p + c.prompt
|
||||
comp = comp + c.completion
|
||||
end
|
||||
end
|
||||
return p, comp
|
||||
end
|
||||
```
|
||||
|
||||
### Reset semantics
|
||||
|
||||
`Context:reset()` deliberately does NOT clear `usage_totals` —
|
||||
matches R8 invariant from Phase 6 (`:reset` clears `turns`,
|
||||
`pending_exec_output`, `summary`; preserves `memory_items`, `project`,
|
||||
and now `usage_totals`). The user reset their conversation, not their
|
||||
cost meter. `:cost reset` is the explicit reset verb for the meter.
|
||||
|
||||
---
|
||||
|
||||
## 6. Pillar 3 — `:cost` meta
|
||||
|
||||
```
|
||||
:cost summary line
|
||||
:cost detail per-model + per-category breakdown
|
||||
:cost reset zero out ctx.usage_totals + cost_warn_fired
|
||||
```
|
||||
|
||||
Summary format:
|
||||
|
||||
```
|
||||
[aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens
|
||||
cost=$0.0234 (cloud only; local: 0)
|
||||
```
|
||||
|
||||
Detail format (sorted by total cost desc, then by model):
|
||||
|
||||
```
|
||||
[aish] session usage detail:
|
||||
cloud main 8 calls, 3,850 / 980 tokens, $0.0180
|
||||
cloud delegate 1 call, 250 / 80 tokens, $0.0012
|
||||
cloud probe 1 call, 150 / 30 tokens, $0.0042
|
||||
fast main 14 calls, 8,200 / 2,100 tokens, $0 (local)
|
||||
```
|
||||
|
||||
Implementation: pure Lua iteration over `ctx.usage_totals`; no broker
|
||||
calls. Sorting uses `table.sort` on a flattened list.
|
||||
|
||||
---
|
||||
|
||||
## 7. Pillar 4 — Warning thresholds
|
||||
|
||||
Config:
|
||||
|
||||
```lua
|
||||
cost = {
|
||||
warn_at_dollars = 0.50, -- emit once when cumulative cost crosses
|
||||
warn_at_tokens = 100000, -- emit once when cumulative tokens crosses
|
||||
}
|
||||
```
|
||||
|
||||
After every `ctx:add_usage`, check:
|
||||
|
||||
```lua
|
||||
if config.cost and not ctx.cost_warn_fired then
|
||||
local cost = ctx:total_cost()
|
||||
if config.cost.warn_at_dollars and cost >= config.cost.warn_at_dollars then
|
||||
renderer.status(("session cost $%.4f has crossed warn_at_dollars=$%.4f")
|
||||
:format(cost, config.cost.warn_at_dollars))
|
||||
ctx.cost_warn_fired = true
|
||||
end
|
||||
-- (similar for warn_at_tokens; share the flag or use two)
|
||||
end
|
||||
```
|
||||
|
||||
One-shot per session. `:cost reset` clears the flag.
|
||||
|
||||
---
|
||||
|
||||
## 8. UX Surface Summary
|
||||
|
||||
| Meta | Behavior |
|
||||
|---|---|
|
||||
| `:cost` | One-line summary: calls / tokens / cost |
|
||||
| `:cost detail` | Per-model + per-category breakdown |
|
||||
| `:cost reset` | Zero out totals + clear warn-fired flag |
|
||||
|
||||
| Config | Default | Effect |
|
||||
|---|---|---|
|
||||
| `cfg.cost.warn_at_dollars` | nil | Status when cumulative cost first crosses this dollar amount |
|
||||
| `cfg.cost.warn_at_tokens` | nil | Status when cumulative total tokens first crosses |
|
||||
| (broker `opts.include_usage`) | true | Adds `stream_options.include_usage = true` to outbound request |
|
||||
|
||||
---
|
||||
|
||||
## 9. Out of Scope (Phase 7)
|
||||
|
||||
- **Cross-session cost persistence** — Q-C2 defers `<history.dir>/cost.jsonl`
|
||||
rollup; v1 is session-only. Per-turn usage IS in the session JSONL for
|
||||
after-the-fact aggregation if anyone wants to script it.
|
||||
- **Per-model rate limiting / cost caps that REFUSE the call** — v1 only
|
||||
warns. A future phase could add a hard cap that aborts before the
|
||||
broker call.
|
||||
- **Pricing-table fallback for local models** — if a local model doesn't
|
||||
emit `usage.cost`, we record 0. Estimating cost from token count + a
|
||||
static pricing table is a future polish (most users won't care about
|
||||
local "cost" anyway — local is free).
|
||||
- **Pretty token-bandwidth charts / sparklines** — out of scope; the
|
||||
detail breakdown is text-only.
|
||||
- **Estimated cost for future turns** — no preflight cost prediction.
|
||||
- **MCP tool-call usage** — MCP servers don't expose token usage;
|
||||
broker calls invoked DURING MCP tool dispatch ARE captured (because
|
||||
they go through the same path), but the MCP tool call itself isn't.
|
||||
|
||||
---
|
||||
|
||||
## 10. Risks
|
||||
|
||||
| Risk | Mitigation |
|
||||
|---|---|
|
||||
| Some providers reject `stream_options` -> SSE errors at the top of the stream | `opts.include_usage = false` opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior |
|
||||
| OpenRouter `cost` field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) | Capture `usage.cost` as-is (number); document that the same provider must be used for cross-call comparison |
|
||||
| Local llama.cpp returns no `cost` -> displayed `$0` could mislead user "is this REALLY free?" | `:cost detail` annotates local lines with `(local)` literal; summary says `cost=$X (cloud only; local: 0)` |
|
||||
| `ctx.usage_totals` grows unboundedly with new model names mid-session | Bounded by `#models in config` × `#categories` — small constants. No mitigation needed. |
|
||||
| Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold | Acceptable for v1; user can `:cost reset` to re-arm. Future polish: warn at each Nx multiple. |
|
||||
|
||||
---
|
||||
|
||||
## 11. Open Questions (Phase 7)
|
||||
|
||||
| # | Question | Impact | Resolution target |
|
||||
|---|---|---|---|
|
||||
| Q-C1 | What to do when a provider doesn't emit `usage` (local llama.cpp, some misconfigured proxies)? Record zero / silently skip / estimate via char/4? | Accumulator entries for those models | Analyze (probe what hossenfelder actually returns per model class) |
|
||||
| Q-C2 | Should `<history.dir>/cost.jsonl` accumulate cross-session totals? `:cost --all-time` reporter? | New file + persistence layer | Defer to follow-up phase (v1 = session only) |
|
||||
| Q-C3 | Categories vs free-form tags — should `opts.category` be free-form (caller decides) or validated against a closed set? | Predictability of `:cost detail` output | Analyze |
|
||||
| Q-C4 | Does the hossenfelder broker forward `stream_options` to all backends? Some backends may strip it. | include_usage default | Baseline (real probe) |
|
||||
| Q-C5 | Should `cfg.cost.warn_at_dollars` trigger on the FIRST broker call that crosses, or on the NEXT one (so user sees the call complete before the warn)? | UX detail | Analyze |
|
||||
| Q-C6 | When `:reset` is called, should `cost_warn_fired` be cleared too (so the next warn-cross fires again)? Or only clear via `:cost reset`? | Reset surface area | Analyze |
|
||||
|
||||
---
|
||||
|
||||
## 12. Phase 7 → Phase 8+ Out-of-band
|
||||
|
||||
Candidate follow-ups (non-binding):
|
||||
|
||||
- **Phase 8**: cross-session cost persistence (Q-C2 deferral), with
|
||||
optional cost dashboards / weekly rollup reporter.
|
||||
- **Hard rate limits / cost caps that REFUSE the call** — an extension
|
||||
of the warn surface that promotes warnings into preflight enforcement.
|
||||
- **Better tokenization** (Q1 deferred-from-Phase-3): replace the char/4
|
||||
heuristic on `Context:estimate_tokens()` with model `/tokenize` calls.
|
||||
Indirectly improves accuracy of any future "preflight cost predictor".
|
||||
|
||||
Phase 7 itself is self-contained — no upstream dependencies.
|
||||
Reference in New Issue
Block a user