0f14dc1727
Status: Analyze -> Plan.
Q-C4 was the last open question pending baseline; now resolved per
B1 (stream_options accepted by both backends; required for local).
§13 Implementation Plan added — 6 commits, bottom-up:
1. broker.lua: usage extraction from final SSE chunk; build_request
signature widening to (model_cfg, msgs, stream, opts); on_delta
("usage", payload); chat returns (text, usage); opts.category
passthrough.
2. context.lua: usage_totals + cost_warn_fired fields; add_usage /
total_cost / total_tokens helpers; :reset preserves both.
3. repl.lua: wire opts.category at 5 non-Norris call sites (main,
delegate x2, summarize, memory_summarize); on_delta("usage")
branch routes to ctx:add_usage.
4. safety.lua: wire opts.category for Norris main broker + is_
destructive LLM probe; helpers.on_usage callback convention
(no new module dep — matches #52's scrub_msgs pattern).
5. repl.lua: :cost meta surface + warn-threshold check + HELP.
6. config.lua: commented cost example block + PHASE7.md status
bump to Implement.
Per-commit risk index covers signature-change blast radius, missed
call-site lint, and warn-flag one-shot semantics. Lua's multi-
return semantics keep broker.chat backwards-compat automatic.
Two items left open at plan, resolve at implement:
- is_destructive opts.on_usage vs cfg.helpers threading
- per-turn verbose mode (deferred; v1 = :cost on demand only)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
586 lines
29 KiB
Markdown
586 lines
29 KiB
Markdown
# aish — Phase 7 Manifest
|
||
|
||
**Project:** aish — AI-augmented conversational shell
|
||
**Document:** Phase 7 Requirements, Architecture & Design Decisions
|
||
**Status:** Plan (formulate + analyze + baseline complete; tree at `2244a3f`)
|
||
**Date:** 2026-05-16
|
||
|
||
**Analyze findings (2026-05-16):**
|
||
|
||
A1. **broker.chat_stream surface is clean for the extension.** The
|
||
existing `on_event(data)` closure inside `M.chat_stream` already
|
||
parses `doc.error` / `doc.choices` / `delta` / tool_calls — adding
|
||
`if doc.usage then final_usage = ... end` is one block. Emission
|
||
happens via a closure-local `final_usage` that the post-loop code
|
||
in `chat_stream` reads and calls `on_delta("usage", final_usage)`
|
||
on. `build_request` needs minor extension OR (cleaner) `chat_stream`
|
||
inserts `stream_options.include_usage = true` into the body table
|
||
AFTER `json.encode` — but we currently encode in `build_request`.
|
||
Cleanest: extend `build_request(model_cfg, messages, stream, opts)`
|
||
so it can read `opts.include_usage`. Phase 7 simplifies the
|
||
signature in passing.
|
||
|
||
A2. **7 caller sites** identified for `opts.category` threading:
|
||
|
||
| Site | Category |
|
||
|---|---|
|
||
| `safety.lua:191` (LLM probe) | `"probe"` |
|
||
| `safety.lua:354` (norris main) | `"norris"` |
|
||
| `repl.lua:326` (summarize-on-evict) | `"summarize"` |
|
||
| `repl.lua:685` (call_broker wrapper, used by ask_ai) | `"main"` |
|
||
| `repl.lua:1104` (DELEGATE: handler) | `"delegate"` |
|
||
| `repl.lua:1587` (:memory summarize) | `"memory_summarize"` |
|
||
| `repl.lua:2156` (:delegate meta) | `"delegate"` |
|
||
|
||
All callers pass `opts` already; adding a `category` field is
|
||
additive and backward-compatible (default to `"main"` when absent).
|
||
|
||
A3. **`build_request` signature simplification.** Today it takes
|
||
`(model_cfg, messages, stream, tools, max_tokens)` — five positional
|
||
args. With Phase 7 needing `include_usage` AND `stream_options`,
|
||
positional growth gets unwieldy. **Resolution:** widen to
|
||
`(model_cfg, messages, stream, opts)` where opts carries
|
||
`{tools, max_tokens, include_usage, stream_options}`. Callers in
|
||
`M.chat_stream` and `M.chat` pass their existing opts table through.
|
||
This is a refactor but contained inside broker.lua.
|
||
|
||
A4. **Q-C3 RESOLVED: free-form categories.** The closed-set vs free-form
|
||
debate resolved in favor of free-form per the helpers/skills
|
||
convention already in place (Phase 6 :tree / :diff metas don't
|
||
validate sub-args either). `:cost detail` will show whatever
|
||
categories appear — small + documented closed set in practice
|
||
(7 entries from A2), no surprise.
|
||
|
||
A5. **Q-C5 RESOLVED: warn fires on the call that crossed.** The crossed
|
||
call's usage IS in the accumulator at the moment we check (we
|
||
check AFTER `add_usage`). Firing on the NEXT call would mean a
|
||
delay of one full broker round-trip before the user sees the
|
||
warn — defeats the purpose. Just emit-on-cross.
|
||
|
||
A6. **Q-C6 RESOLVED: `:reset` does NOT clear `cost_warn_fired`.**
|
||
Parity with `usage_totals` itself (per the §2 decision row); the
|
||
user reset their conversation, not their cost meter. The flag
|
||
AND the totals are reset only by the explicit `:cost reset` verb.
|
||
|
||
A7. **Norris call-graph rewires (existing safety.lua:354 path):** with
|
||
issue #52 wired (commit `955bd82`), the Norris broker call now
|
||
passes `helpers.scrub_msgs` / `helpers.streaming_rehydrator`. The
|
||
on_delta wrapping pattern means I need to be careful that the new
|
||
`("usage", payload)` kind also flows through any wrapper. Since
|
||
secrets streaming_rehydrator only matches on `kind == "text"`, the
|
||
"usage" kind passes through unchanged. No new entanglement.
|
||
|
||
A8. **`ctx.usage_totals` survives `:reset` per R8** — same invariant
|
||
as `memory_items` (Phase 4) and `project` (Phase 6). Documented in
|
||
§5 of the manifest; reinforces the "ambient context survives
|
||
conversation reset" rule.
|
||
|
||
A9. **Session JSONL serialization** — assistant turn dict gets an
|
||
optional `usage` field. `history.lua` log_turn currently calls
|
||
`json.encode(turn)` opaquely; the dkjson serializer handles nested
|
||
tables. No code change needed; the new field flows through
|
||
automatically when the assistant turn carries one.
|
||
|
||
A10. **Q-C1 PARTIAL: local providers may not emit `usage`.** The
|
||
formulate-time assumption was "treat absence as zero-cost / unknown".
|
||
A real probe against `qwen-coder-7b-snappy-8k` is a baseline
|
||
action — see B-probes below. The implementation will be defensive:
|
||
if `doc.usage` never appears in the stream, no "usage" event is
|
||
emitted, and the accumulator is unchanged for that turn. `:cost`
|
||
output naturally reflects "0 calls counted for local model" if
|
||
that's the case.
|
||
|
||
A11. **Q-C4 deferred to baseline**: actual `stream_options` forwarding
|
||
by the hossenfelder proxy must be probed against a live broker.
|
||
If the proxy strips the option, we get no `usage` events even
|
||
for cloud calls. Baseline action.
|
||
|
||
PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest
|
||
specifies what Phase 7 adds — **cost / usage observability**: the ability
|
||
to know, mid-session, how many tokens you've spent and how much money the
|
||
paid-cloud calls have cost.
|
||
|
||
PHASE0 §11 originally listed phases only through 6; this commit amends
|
||
§11 to add Phase 7.
|
||
|
||
---
|
||
|
||
## 1. Scope of Phase 7
|
||
|
||
Four pillars:
|
||
|
||
1. **Usage capture in broker** — `broker.chat_stream` extracts the
|
||
provider's `usage` block (and `cost` where present) from the response
|
||
stream. Surfaces it to the caller via a new `on_delta("usage", ...)`
|
||
kind. The existing `broker.chat` buffering wrapper exposes it as a
|
||
second return value `(text, usage)`. Backward-compatible: callers
|
||
that don't handle the new kind / second value simply ignore it.
|
||
|
||
2. **Per-session accumulator on `ctx`** — running totals per-model AND
|
||
per-call-category (main / delegate / summarize / probe) accumulate on
|
||
`ctx.usage_totals`. No persistence across sessions in v1 (Q-C2
|
||
defers cross-session); the session-log JSONL files DO carry per-turn
|
||
usage so historical analysis is possible after the fact.
|
||
|
||
3. **`:cost` meta** — a `:cost` reporter that shows the current session
|
||
totals, with optional `:cost detail` for the per-model + per-category
|
||
breakdown. Zero broker calls (purely local read of `ctx.usage_totals`).
|
||
|
||
4. **Optional warning thresholds** — `cfg.cost.warn_at_dollars` and
|
||
`cfg.cost.warn_at_tokens` emit a status the first time the running
|
||
total crosses the configured threshold. Default off (no warnings
|
||
without config). Useful when cloud presets are configured and you
|
||
want a "you've spent $1 this session" nudge before runaway cost.
|
||
|
||
**Phase 7 is done when:**
|
||
|
||
- `broker.chat_stream` exposes usage via the new `on_delta("usage", ...)`
|
||
callback kind; `broker.chat` returns `(text, usage)`. Backward compat
|
||
preserved (no existing caller breaks).
|
||
- After a session with mixed local + cloud calls, `:cost` prints a
|
||
total like:
|
||
```
|
||
[aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens
|
||
cost=$0.0234 (cloud only; local: 0)
|
||
```
|
||
- `:cost detail` breaks down by model + category:
|
||
```
|
||
fast main: 14 turns, 8200/2100 tokens
|
||
cloud main: 8 turns, 3850/980 tokens, $0.0180
|
||
cloud delegate: 1 turn, 250/80 tokens, $0.0012
|
||
cloud probe: 1 turn, 150/30 tokens, $0.0042
|
||
```
|
||
- Session JSONL gains a `usage` field on assistant turns (when the
|
||
broker returned one).
|
||
- With `cfg.cost.warn_at_dollars = 0.50` set, crossing $0.50 cumulative
|
||
emits exactly one status line.
|
||
- Existing configs without `cfg.cost` behave exactly like Phase 6
|
||
(Phase 6 regression coverage).
|
||
|
||
---
|
||
|
||
## 2. Technology Decisions (delta from Phase 6)
|
||
|
||
| Decision | Choice | Rationale |
|
||
|---|---|---|
|
||
| Where to extract usage | In `broker.chat_stream` event loop, looking at each SSE event's `usage` field on the final chunk | The OpenAI streaming spec puts `usage` on the FINAL chunk when `stream_options: { include_usage: true }` is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline). |
|
||
| New on_delta kind | `on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? })` | Mirrors the existing `("text", chunk)` / `("tool_call", call)` shape. Callers ignore unknown kinds; backward-compatible. |
|
||
| Where to enable usage on the wire | `opts.include_usage = true` (default `true`) sets `stream_options.include_usage = true` in the outbound request body | Off-switch for hosts that reject `stream_options`. Defaults on; baseline probe confirms current broker tolerates it. (A3: `build_request` signature widens to take an `opts` table; positional growth was getting unwieldy.) |
|
||
| Accumulator location | `ctx.usage_totals[model_name][category]` table | ctx is per-conversation; matches the `:reset`-survives-or-not rules already in place. |
|
||
| Categories | `"main"` (ask_ai), `"delegate"`, `"summarize"`, `"memory_summarize"`, `"probe"`, `"norris"` | One-tag-per-call-site. Tagged at the caller site (caller passes `opts.category` to `broker.chat_stream`). |
|
||
| Cost extraction | `usage.cost` (OpenRouter convention; dollars as a number) plus `usage.cost_details.upstream_inference_cost` (more detailed). For Anthropic/Bedrock the cost arrives in dollars on `usage.cost`. For pure local llama.cpp: no `cost` field — record 0. | Single field name across all observed providers (per baseline B7 — to be confirmed). |
|
||
| Cost precision | Store as `number` (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) | No floating-point cumulative-error concerns at this scale. |
|
||
| Warning trigger | First crossing of either threshold emits a single status: `[aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY`. Crossed-flag stored on ctx; reset only on session end / `:cost reset`. | One-shot to avoid spamming. |
|
||
| `:reset` interaction | `:reset` does NOT clear `ctx.usage_totals` (parity with `memory_items`/`project`) — the user reset their conversation, not their cost tracking. `:cost reset` is the explicit reset verb. | Matches R8 invariant from Phase 6. |
|
||
| Session-log persistence | Assistant turn entries gain an optional `usage` field when broker returned one. `history.lua` log_turn writes it through verbatim. | Per-turn granularity preserved for after-the-fact analysis. No new file. |
|
||
|
||
---
|
||
|
||
## 3. Module Changes
|
||
|
||
| File | State after Phase 6 | Phase 7 changes |
|
||
|---|---|---|
|
||
| `broker.lua` | `chat_stream(cfg, msgs, on_delta, opts)` with text + tool_call kinds; `chat` returns text | Extract usage from final SSE chunk; emit `on_delta("usage", payload)`; `chat` returns `(text, usage)`. New `opts.include_usage` (default true); new `opts.category` (passed through as a tag in the usage payload). |
|
||
| `context.lua` | system prompt + turns + memory + project + summary | Add `self.usage_totals` (table) + `self.cost_warn_fired` (bool). New helpers: `Context:add_usage(model, category, usage)`, `Context:total_cost()`, `Context:total_tokens()`. `Context:reset` does NOT clear `usage_totals` (parity with memory_items / project per R8). |
|
||
| `repl.lua` | ask_ai + delegate + summarize callbacks + Norris helpers | Wire `opts.category` at each broker call site (main / delegate / summarize / memory_summarize). Wire `on_delta("usage", ...)` -> `ctx:add_usage(...)`. New `:cost` and `:cost detail` / `:cost reset` metas. Cost-warn check after each `add_usage` call. |
|
||
| `safety.lua` | norris_step + is_destructive | Pass `opts.category = "norris"` (for the main chat_stream call) and `"probe"` (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since `safety.llm_model = "cloud"` is the recommended setting. |
|
||
| `history.lua` | session.log_turn appends JSONL entries | log_turn already takes turn opaquely; assistant turns will carry `usage` if present and it'll serialize via dkjson. No code change unless filter desired. |
|
||
| `config.lua` | example blocks for mcp/safety/memory/routing/secrets/hooks/project | Add commented-out `cost = { warn_at_dollars, warn_at_tokens }` block. |
|
||
| `docs/PHASE0.md` | §11 lists phases 0-6 | **Amendment**: add Phase 7 row to §11. |
|
||
|
||
No new module files.
|
||
|
||
---
|
||
|
||
## 4. Pillar 1 — Usage capture in broker
|
||
|
||
### SSE shape (provider-by-provider — confirm in baseline)
|
||
|
||
For OpenAI-compatible streams with `stream_options: { include_usage: true }`:
|
||
|
||
```json
|
||
data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]}
|
||
data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]}
|
||
data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}}
|
||
data: [DONE]
|
||
```
|
||
|
||
The final usage event arrives AFTER `finish_reason` but BEFORE `[DONE]`.
|
||
`choices` is empty `[]` on the usage event.
|
||
|
||
For non-streaming `chat`: usage is in the response body at the top level.
|
||
broker.chat is a wrapper around chat_stream, so it inherits the on_delta
|
||
path.
|
||
|
||
For local llama.cpp via hossenfelder: usage may or may not be present
|
||
depending on the proxy's version. Treat absence as zero-cost / unknown.
|
||
|
||
### Extraction algorithm
|
||
|
||
```lua
|
||
local final_usage = nil
|
||
|
||
local function on_event(data)
|
||
...
|
||
if doc.usage then
|
||
-- Provider sent usage; capture for emission after the stream.
|
||
final_usage = {
|
||
prompt_tokens = doc.usage.prompt_tokens or 0,
|
||
completion_tokens = doc.usage.completion_tokens or 0,
|
||
total_tokens = doc.usage.total_tokens or 0,
|
||
cost = doc.usage.cost, -- nil for local
|
||
model = doc.model or model_cfg.model,
|
||
}
|
||
-- Don't emit yet — the [DONE] event marks stream end; emit
|
||
-- once we exit the curl.post_sse loop so the caller sees
|
||
-- usage as the LAST event in the stream order.
|
||
end
|
||
-- ... existing text + tool_call handling ...
|
||
end
|
||
|
||
-- After curl.post_sse returns (stream complete):
|
||
if final_usage then on_delta("usage", final_usage) end
|
||
```
|
||
|
||
### Outbound include_usage
|
||
|
||
```lua
|
||
local body_table = { model = ..., messages = ..., stream = true }
|
||
if opts.include_usage ~= false then
|
||
body_table.stream_options = { include_usage = true }
|
||
end
|
||
```
|
||
|
||
Risk: some providers reject unrecognized fields. Baseline check; if any
|
||
host throws on `stream_options`, the per-model opt-out is one line.
|
||
|
||
### Category tagging
|
||
|
||
`opts.category` is a string set by the caller. broker echoes it into the
|
||
emitted usage payload so the accumulator knows what to credit. Default
|
||
category if absent: `"main"`.
|
||
|
||
---
|
||
|
||
## 5. Pillar 2 — Accumulator on ctx
|
||
|
||
### Shape
|
||
|
||
```lua
|
||
ctx.usage_totals = {
|
||
-- [model_name] = { [category] = { prompt = N, completion = N,
|
||
-- calls = N, cost = N } }
|
||
fast = {
|
||
main = { prompt = 1234, completion = 567, calls = 14, cost = 0 },
|
||
},
|
||
cloud = {
|
||
main = { prompt = 3850, completion = 980, calls = 8, cost = 0.0180 },
|
||
delegate = { prompt = 250, completion = 80, calls = 1, cost = 0.0012 },
|
||
probe = { prompt = 150, completion = 30, calls = 1, cost = 0.0042 },
|
||
},
|
||
}
|
||
ctx.cost_warn_fired = false
|
||
```
|
||
|
||
### add_usage
|
||
|
||
```lua
|
||
function Context:add_usage(model, category, u)
|
||
model = model or "?"
|
||
category = category or "main"
|
||
self.usage_totals = self.usage_totals or {}
|
||
local m = self.usage_totals[model] or {}
|
||
local c = m[category] or { prompt = 0, completion = 0, calls = 0, cost = 0 }
|
||
c.prompt = c.prompt + (u.prompt_tokens or 0)
|
||
c.completion = c.completion + (u.completion_tokens or 0)
|
||
c.calls = c.calls + 1
|
||
c.cost = c.cost + (u.cost or 0)
|
||
m[category] = c
|
||
self.usage_totals[model] = m
|
||
end
|
||
|
||
function Context:total_cost()
|
||
local total = 0
|
||
for _, m in pairs(self.usage_totals or {}) do
|
||
for _, c in pairs(m) do total = total + c.cost end
|
||
end
|
||
return total
|
||
end
|
||
|
||
function Context:total_tokens()
|
||
local p, comp = 0, 0
|
||
for _, m in pairs(self.usage_totals or {}) do
|
||
for _, c in pairs(m) do
|
||
p = p + c.prompt
|
||
comp = comp + c.completion
|
||
end
|
||
end
|
||
return p, comp
|
||
end
|
||
```
|
||
|
||
### Reset semantics
|
||
|
||
`Context:reset()` deliberately does NOT clear `usage_totals` —
|
||
matches R8 invariant from Phase 6 (`:reset` clears `turns`,
|
||
`pending_exec_output`, `summary`; preserves `memory_items`, `project`,
|
||
and now `usage_totals`). The user reset their conversation, not their
|
||
cost meter. `:cost reset` is the explicit reset verb for the meter.
|
||
|
||
---
|
||
|
||
## 6. Pillar 3 — `:cost` meta
|
||
|
||
```
|
||
:cost summary line
|
||
:cost detail per-model + per-category breakdown
|
||
:cost reset zero out ctx.usage_totals + cost_warn_fired
|
||
```
|
||
|
||
Summary format:
|
||
|
||
```
|
||
[aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens
|
||
cost=$0.0234 (cloud only; local: 0)
|
||
```
|
||
|
||
Detail format (sorted by total cost desc, then by model):
|
||
|
||
```
|
||
[aish] session usage detail:
|
||
cloud main 8 calls, 3,850 / 980 tokens, $0.0180
|
||
cloud delegate 1 call, 250 / 80 tokens, $0.0012
|
||
cloud probe 1 call, 150 / 30 tokens, $0.0042
|
||
fast main 14 calls, 8,200 / 2,100 tokens, $0 (local)
|
||
```
|
||
|
||
Implementation: pure Lua iteration over `ctx.usage_totals`; no broker
|
||
calls. Sorting uses `table.sort` on a flattened list.
|
||
|
||
---
|
||
|
||
## 7. Pillar 4 — Warning thresholds
|
||
|
||
Config:
|
||
|
||
```lua
|
||
cost = {
|
||
warn_at_dollars = 0.50, -- emit once when cumulative cost crosses
|
||
warn_at_tokens = 100000, -- emit once when cumulative tokens crosses
|
||
}
|
||
```
|
||
|
||
After every `ctx:add_usage`, check:
|
||
|
||
```lua
|
||
if config.cost and not ctx.cost_warn_fired then
|
||
local cost = ctx:total_cost()
|
||
if config.cost.warn_at_dollars and cost >= config.cost.warn_at_dollars then
|
||
renderer.status(("session cost $%.4f has crossed warn_at_dollars=$%.4f")
|
||
:format(cost, config.cost.warn_at_dollars))
|
||
ctx.cost_warn_fired = true
|
||
end
|
||
-- (similar for warn_at_tokens; share the flag or use two)
|
||
end
|
||
```
|
||
|
||
One-shot per session. `:cost reset` clears the flag.
|
||
|
||
---
|
||
|
||
## 8. UX Surface Summary
|
||
|
||
| Meta | Behavior |
|
||
|---|---|
|
||
| `:cost` | One-line summary: calls / tokens / cost |
|
||
| `:cost detail` | Per-model + per-category breakdown |
|
||
| `:cost reset` | Zero out totals + clear warn-fired flag |
|
||
|
||
| Config | Default | Effect |
|
||
|---|---|---|
|
||
| `cfg.cost.warn_at_dollars` | nil | Status when cumulative cost first crosses this dollar amount |
|
||
| `cfg.cost.warn_at_tokens` | nil | Status when cumulative total tokens first crosses |
|
||
| (broker `opts.include_usage`) | true | Adds `stream_options.include_usage = true` to outbound request |
|
||
|
||
---
|
||
|
||
## 9. Out of Scope (Phase 7)
|
||
|
||
- **Cross-session cost persistence** — Q-C2 defers `<history.dir>/cost.jsonl`
|
||
rollup; v1 is session-only. Per-turn usage IS in the session JSONL for
|
||
after-the-fact aggregation if anyone wants to script it.
|
||
- **Per-model rate limiting / cost caps that REFUSE the call** — v1 only
|
||
warns. A future phase could add a hard cap that aborts before the
|
||
broker call.
|
||
- **Pricing-table fallback for local models** — if a local model doesn't
|
||
emit `usage.cost`, we record 0. Estimating cost from token count + a
|
||
static pricing table is a future polish (most users won't care about
|
||
local "cost" anyway — local is free).
|
||
- **Pretty token-bandwidth charts / sparklines** — out of scope; the
|
||
detail breakdown is text-only.
|
||
- **Estimated cost for future turns** — no preflight cost prediction.
|
||
- **MCP tool-call usage** — MCP servers don't expose token usage;
|
||
broker calls invoked DURING MCP tool dispatch ARE captured (because
|
||
they go through the same path), but the MCP tool call itself isn't.
|
||
|
||
---
|
||
|
||
## 10. Risks
|
||
|
||
| Risk | Mitigation |
|
||
|---|---|
|
||
| Some providers reject `stream_options` -> SSE errors at the top of the stream | `opts.include_usage = false` opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior |
|
||
| OpenRouter `cost` field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) | Capture `usage.cost` as-is (number); document that the same provider must be used for cross-call comparison |
|
||
| Local llama.cpp returns no `cost` -> displayed `$0` could mislead user "is this REALLY free?" | `:cost detail` annotates local lines with `(local)` literal; summary says `cost=$X (cloud only; local: 0)` |
|
||
| `ctx.usage_totals` grows unboundedly with new model names mid-session | Bounded by `#models in config` × `#categories` — small constants. No mitigation needed. |
|
||
| Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold | Acceptable for v1; user can `:cost reset` to re-arm. Future polish: warn at each Nx multiple. |
|
||
|
||
---
|
||
|
||
## 11. Open Questions (Phase 7)
|
||
|
||
| # | Question | Impact | Resolution target |
|
||
|---|---|---|---|
|
||
| Q-C1 | Provider-without-usage handling | A10 — defensive silent skip; baseline probe will confirm shape on local llama.cpp. |
|
||
| Q-C2 | Cross-session cost persistence (`cost.jsonl`) | Deferred to follow-up phase 8; v1 is session-only. |
|
||
| Q-C3 | Categories closed-set vs free-form | A4 — **free-form**; caller decides. Matches Phase 6 helpers/skills convention. |
|
||
| Q-C4 | `stream_options` forwarding by hossenfelder | B1 RESOLVED — both backends accept; flag is REQUIRED for local llama.cpp, no-op for cloud. Default-true is correct. |
|
||
| Q-C5 | Warn fires on the crossed call or the next | A5 — **on the crossed call** (no UX-defeating delay). |
|
||
| Q-C6 | `:reset` clears `cost_warn_fired` | A6 — **no**, only `:cost reset` clears the flag (R8 parity). |
|
||
|
||
---
|
||
|
||
## 12. Phase 7 → Phase 8+ Out-of-band
|
||
|
||
Candidate follow-ups (non-binding):
|
||
|
||
- **Phase 8**: cross-session cost persistence (Q-C2 deferral), with
|
||
optional cost dashboards / weekly rollup reporter.
|
||
- **Hard rate limits / cost caps that REFUSE the call** — an extension
|
||
of the warn surface that promotes warnings into preflight enforcement.
|
||
- **Better tokenization** (Q1 deferred-from-Phase-3): replace the char/4
|
||
heuristic on `Context:estimate_tokens()` with model `/tokenize` calls.
|
||
Indirectly improves accuracy of any future "preflight cost predictor".
|
||
|
||
Phase 7 itself is self-contained — no upstream dependencies.
|
||
|
||
---
|
||
|
||
## 13. Implementation Plan (commit-by-commit)
|
||
|
||
Bottom-up; broker first (it's the egress point that all callers
|
||
depend on), then context (the accumulator), then the call-site
|
||
rewires, then the user-facing meta + warn surface, then config +
|
||
status bump. Each commit leaves the tree green (existing tests +
|
||
load smoke + per-commit feature smoke).
|
||
|
||
### Order
|
||
|
||
1. **`broker.lua` — usage capture + signature widening.**
|
||
- `build_request(model_cfg, messages, stream, opts)` widened to
|
||
take an opts table; opts.tools / opts.max_tokens fold in from
|
||
the existing positional args. Opts.include_usage (default true)
|
||
adds `stream_options.include_usage = true` to the request body
|
||
(per B1, required for local).
|
||
- `M.chat_stream` event loop adds `if doc.usage then final_usage =
|
||
doc.usage end`; after `curl.post_sse` returns, if `final_usage`
|
||
is set, `on_delta("usage", payload)` is called. Payload includes
|
||
`model = model_cfg.model` (caller-stable per B4), the raw token
|
||
counts, and `cost` as a number (nil for local per B3).
|
||
- opts.category passthrough — the broker just echoes it into the
|
||
emitted usage payload; doesn't validate (per A4 free-form).
|
||
- `M.chat` (the non-streaming wrapper) returns `(text, usage)` —
|
||
backward-compatible (existing callers ignore the second value).
|
||
- Smoke: hand-build a request with stream_options, capture all
|
||
three on_delta kinds (text, tool_call when applicable, usage),
|
||
confirm usage payload matches what curl shows.
|
||
|
||
2. **`context.lua` — accumulator + helpers.**
|
||
- `Context.new`: `self.usage_totals = {}` + `self.cost_warn_fired = false`.
|
||
- `Context:add_usage(model, category, usage)` — increments
|
||
`usage_totals[model][category]` slots.
|
||
- `Context:total_cost()` — sums all cost fields across all models/categories.
|
||
- `Context:total_tokens()` — sums prompt + completion separately.
|
||
- `Context:reset` — does NOT touch `usage_totals` or `cost_warn_fired`
|
||
(R8 parity with `memory_items` and `project`).
|
||
- Smoke: 4-case inline test of add_usage / totals / reset preservation.
|
||
|
||
3. **`repl.lua` — wire opts.category + on_delta("usage") at non-Norris call sites.**
|
||
- call_broker wrapper (used by ask_ai): pass `opts.category =
|
||
"main"`; the on_delta wrapper handles `kind == "usage"` by
|
||
calling `ctx:add_usage(req_name, "main", payload)`.
|
||
- DELEGATE: handler: opts.category = "delegate".
|
||
- :delegate meta: opts.category = "delegate".
|
||
- summarize-on-evict callback: opts.category = "summarize".
|
||
- :memory summarize: opts.category = "memory_summarize".
|
||
- For broker.chat callers (non-streaming): capture the new second
|
||
return value and feed to ctx:add_usage.
|
||
- Smoke: send one cloud prompt, observe ctx.usage_totals grows.
|
||
|
||
4. **`safety.lua` — opts.category for Norris + probe.**
|
||
- safety.norris_step's broker.chat_stream call: pass opts.category =
|
||
"norris"; the helpers.on_usage callback (added to the helpers
|
||
table by repl.lua) routes back to ctx:add_usage. OR — simpler —
|
||
safety.lua wraps on_delta itself with a "usage"-kind branch that
|
||
calls helpers.on_usage.
|
||
- safety.is_destructive's llm_probe broker.chat call: pass
|
||
opts.category = "probe"; capture the (text, usage) return and
|
||
forward via opts.on_usage callback (added to is_destructive opts).
|
||
- Smoke: a Norris session shows both "norris" and "probe" category
|
||
entries in :cost detail.
|
||
|
||
5. **`repl.lua` — :cost meta + warn-threshold + HELP.**
|
||
- :cost (summary), :cost detail (per-model+category breakdown),
|
||
:cost reset (zero totals + clear cost_warn_fired).
|
||
- After every ctx:add_usage call (centralized in a helper if
|
||
possible), check cfg.cost.warn_at_dollars / warn_at_tokens;
|
||
emit one-shot status if crossed AND cost_warn_fired is false.
|
||
- HELP gains 3 lines for :cost.
|
||
- Smoke: :cost shows totals; :cost detail breaks down; warn fires
|
||
once when threshold crossed; :cost reset re-arms.
|
||
|
||
6. **`config.lua` example block + `docs/PHASE7.md` status bump.**
|
||
- Commented-out `cost = { warn_at_dollars = 0.50, warn_at_tokens
|
||
= 100000 }` block in config.lua.
|
||
- PHASE7.md status header → **Implement** (matches Phase 5/6
|
||
cadence — manifest tracks implementation state).
|
||
|
||
### Risk index per commit
|
||
|
||
| Commit | Risk | Mitigation |
|
||
|---|---|---|
|
||
| 1 (broker) | build_request signature change breaks all existing callers | All callers of chat_stream/chat use opts already; we move tools/max_tokens INTO opts — temporary positional fallback (`opts.tools = old_tools` if positional was used) is unnecessary because every caller already passes opts table |
|
||
| 1 (broker) | `M.chat` second return value confuses callers that do `local r = broker.chat(...)` discarding the second | Lua doesn't error on dropped return values; backward-compat preserved automatically |
|
||
| 2 (context) | usage_totals nil on old ctx serializations | Defensive `self.usage_totals = self.usage_totals or {}` in add_usage; no migration needed |
|
||
| 3 (repl wires) | Forgetting one call site = silent under-count | Lint by grep for `broker.chat\(` and `broker.chat_stream\(` after the wire commit; ensure each is tagged |
|
||
| 4 (safety wires) | safety.lua must NOT require("secrets")-style introduce new module dep | Use helpers.on_usage callback convention (same shape as #52's scrub_msgs) — no module dep |
|
||
| 5 (:cost + warn) | warn fires multiple times when threshold is much exceeded by one call | cost_warn_fired one-shot flag; explicit :cost reset to re-arm |
|
||
| 6 (config + status) | none | |
|
||
|
||
### Tests + smoke per commit
|
||
|
||
Each commit:
|
||
- Pass `luajit test_safety.lua` (87/87) and `luajit test_router_model.lua` (31/31)
|
||
- Load cleanly via `luajit -e 'package.path=...; require("repl"); print("ok")'`
|
||
- Pass a per-feature smoke (described in each row above)
|
||
|
||
### Things deliberately NOT split
|
||
|
||
- broker.chat backward-compat shim — Lua's multiple-return-values
|
||
semantics handle it automatically (existing `local r = broker.chat(..)`
|
||
drops the new `usage` value).
|
||
- Per-category sub-tables — flat `model -> category -> counters` is
|
||
simple enough; nesting deeper for e.g. timestamps is v2.
|
||
- Cross-session persistence — explicitly Q-C2 deferred to phase 8.
|
||
|
||
### Open at plan-time (resolve at implement)
|
||
|
||
- Whether `safety.is_destructive`'s opts should carry `on_usage`
|
||
callback explicitly OR thread through cfg.helpers (the latter
|
||
matches the Norris helpers convention but is more coupling).
|
||
Decide at commit 4. Default to explicit opts.on_usage for minimum
|
||
surface.
|
||
- Whether to emit a `[aish] usage: model=X prompt=N completion=M cost=$X`
|
||
status line PER TURN (verbose mode) or only via :cost on demand.
|
||
v1 = on demand only; verbose mode is a follow-up nice-to-have.
|