Files
aish/docs/PHASE7.md
T
marfrit 3bad07b2da docs/PHASE7: formulate — cost / usage observability
Phase 7 formulate manifest + PHASE0 §11 amendment to add the Phase 7
row (substrate amendment per CLAUDE.md §3, lands in the same commit).

Four pillars:

  1. Usage capture in broker.chat_stream — extract `usage` from the
     final SSE chunk (OpenAI streaming spec with `stream_options:
     {include_usage: true}`). Surface via new on_delta("usage",
     payload) kind. broker.chat returns (text, usage) — backward-
     compat: existing callers ignore the second value.

  2. Per-session accumulator on ctx — ctx.usage_totals[model][category]
     tables (categories: main / delegate / summarize / memory_summarize
     / probe / norris, tagged at the call site via opts.category).
     :reset preserves usage_totals (R8 parity with memory_items /
     project). Session JSONL gains an optional `usage` field on
     assistant turns for after-the-fact analysis.

  3. :cost meta surface — :cost (summary), :cost detail (per-model +
     per-category breakdown), :cost reset (zero the meter). Pure-Lua
     read of ctx.usage_totals; no broker calls.

  4. Optional warn thresholds — cfg.cost.warn_at_dollars /
     warn_at_tokens emit a one-shot status when crossed. Default off;
     useful with cloud presets configured.

Doc covers scope + done-when criteria, tech decisions table, module
changes, per-pillar deep dive with code sketches, UX surface, out of
scope, risks, 6 open questions to resolve in analyze.

Open at formulate:
  Q-C1 — provider-without-usage handling (local llama.cpp probably)
  Q-C2 — cross-session persistence (defer to phase 8)
  Q-C3 — categories closed-set vs free-form
  Q-C4 — does hossenfelder forward stream_options to all backends?
  Q-C5 — warn fires on the call that crosses, or the next one?
  Q-C6 — :reset clears cost_warn_fired too, or only :cost reset?

Scope confirmed via AskUserQuestion: cost/usage observability
(chosen over project-local config overlay and session search/tag).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:47:58 +00:00

18 KiB
Raw Blame History

aish — Phase 7 Manifest

Project: aish — AI-augmented conversational shell Document: Phase 7 Requirements, Architecture & Design Decisions Status: Formulate (pre-analyze) Date: 2026-05-16

PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest specifies what Phase 7 adds — cost / usage observability: the ability to know, mid-session, how many tokens you've spent and how much money the paid-cloud calls have cost.

PHASE0 §11 originally listed phases only through 6; this commit amends §11 to add Phase 7.


1. Scope of Phase 7

Four pillars:

  1. Usage capture in brokerbroker.chat_stream extracts the provider's usage block (and cost where present) from the response stream. Surfaces it to the caller via a new on_delta("usage", ...) kind. The existing broker.chat buffering wrapper exposes it as a second return value (text, usage). Backward-compatible: callers that don't handle the new kind / second value simply ignore it.

  2. Per-session accumulator on ctx — running totals per-model AND per-call-category (main / delegate / summarize / probe) accumulate on ctx.usage_totals. No persistence across sessions in v1 (Q-C2 defers cross-session); the session-log JSONL files DO carry per-turn usage so historical analysis is possible after the fact.

  3. :cost meta — a :cost reporter that shows the current session totals, with optional :cost detail for the per-model + per-category breakdown. Zero broker calls (purely local read of ctx.usage_totals).

  4. Optional warning thresholdscfg.cost.warn_at_dollars and cfg.cost.warn_at_tokens emit a status the first time the running total crosses the configured threshold. Default off (no warnings without config). Useful when cloud presets are configured and you want a "you've spent $1 this session" nudge before runaway cost.

Phase 7 is done when:

  • broker.chat_stream exposes usage via the new on_delta("usage", ...) callback kind; broker.chat returns (text, usage). Backward compat preserved (no existing caller breaks).
  • After a session with mixed local + cloud calls, :cost prints a total like:
    [aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens
                                    cost=$0.0234 (cloud only; local: 0)
    
  • :cost detail breaks down by model + category:
    fast    main: 14 turns, 8200/2100 tokens
    cloud   main: 8 turns, 3850/980 tokens, $0.0180
    cloud   delegate: 1 turn, 250/80 tokens, $0.0012
    cloud   probe: 1 turn, 150/30 tokens, $0.0042
    
  • Session JSONL gains a usage field on assistant turns (when the broker returned one).
  • With cfg.cost.warn_at_dollars = 0.50 set, crossing $0.50 cumulative emits exactly one status line.
  • Existing configs without cfg.cost behave exactly like Phase 6 (Phase 6 regression coverage).

2. Technology Decisions (delta from Phase 6)

Decision Choice Rationale
Where to extract usage In broker.chat_stream event loop, looking at each SSE event's usage field on the final chunk The OpenAI streaming spec puts usage on the FINAL chunk when stream_options: { include_usage: true } is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline).
New on_delta kind on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? }) Mirrors the existing ("text", chunk) / ("tool_call", call) shape. Callers ignore unknown kinds; backward-compatible.
Where to enable usage on the wire opts.include_usage = true (default true) sets stream_options.include_usage = true in the outbound request body Off-switch for hosts that reject stream_options. Defaults on; baseline probe confirms current broker tolerates it.
Accumulator location ctx.usage_totals[model_name][category] table ctx is per-conversation; matches the :reset-survives-or-not rules already in place.
Categories "main" (ask_ai), "delegate", "summarize", "memory_summarize", "probe", "norris" One-tag-per-call-site. Tagged at the caller site (caller passes opts.category to broker.chat_stream).
Cost extraction usage.cost (OpenRouter convention; dollars as a number) plus usage.cost_details.upstream_inference_cost (more detailed). For Anthropic/Bedrock the cost arrives in dollars on usage.cost. For pure local llama.cpp: no cost field — record 0. Single field name across all observed providers (per baseline B7 — to be confirmed).
Cost precision Store as number (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) No floating-point cumulative-error concerns at this scale.
Warning trigger First crossing of either threshold emits a single status: [aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY. Crossed-flag stored on ctx; reset only on session end / :cost reset. One-shot to avoid spamming.
:reset interaction :reset does NOT clear ctx.usage_totals (parity with memory_items/project) — the user reset their conversation, not their cost tracking. :cost reset is the explicit reset verb. Matches R8 invariant from Phase 6.
Session-log persistence Assistant turn entries gain an optional usage field when broker returned one. history.lua log_turn writes it through verbatim. Per-turn granularity preserved for after-the-fact analysis. No new file.

3. Module Changes

File State after Phase 6 Phase 7 changes
broker.lua chat_stream(cfg, msgs, on_delta, opts) with text + tool_call kinds; chat returns text Extract usage from final SSE chunk; emit on_delta("usage", payload); chat returns (text, usage). New opts.include_usage (default true); new opts.category (passed through as a tag in the usage payload).
context.lua system prompt + turns + memory + project + summary Add self.usage_totals (table) + self.cost_warn_fired (bool). New helpers: Context:add_usage(model, category, usage), Context:total_cost(), Context:total_tokens(). Context:reset does NOT clear usage_totals (parity with memory_items / project per R8).
repl.lua ask_ai + delegate + summarize callbacks + Norris helpers Wire opts.category at each broker call site (main / delegate / summarize / memory_summarize). Wire on_delta("usage", ...) -> ctx:add_usage(...). New :cost and :cost detail / :cost reset metas. Cost-warn check after each add_usage call.
safety.lua norris_step + is_destructive Pass opts.category = "norris" (for the main chat_stream call) and "probe" (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since safety.llm_model = "cloud" is the recommended setting.
history.lua session.log_turn appends JSONL entries log_turn already takes turn opaquely; assistant turns will carry usage if present and it'll serialize via dkjson. No code change unless filter desired.
config.lua example blocks for mcp/safety/memory/routing/secrets/hooks/project Add commented-out cost = { warn_at_dollars, warn_at_tokens } block.
docs/PHASE0.md §11 lists phases 0-6 Amendment: add Phase 7 row to §11.

No new module files.


4. Pillar 1 — Usage capture in broker

SSE shape (provider-by-provider — confirm in baseline)

For OpenAI-compatible streams with stream_options: { include_usage: true }:

data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]}
data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]}
data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}}
data: [DONE]

The final usage event arrives AFTER finish_reason but BEFORE [DONE]. choices is empty [] on the usage event.

For non-streaming chat: usage is in the response body at the top level. broker.chat is a wrapper around chat_stream, so it inherits the on_delta path.

For local llama.cpp via hossenfelder: usage may or may not be present depending on the proxy's version. Treat absence as zero-cost / unknown.

Extraction algorithm

local final_usage = nil

local function on_event(data)
    ...
    if doc.usage then
        -- Provider sent usage; capture for emission after the stream.
        final_usage = {
            prompt_tokens     = doc.usage.prompt_tokens or 0,
            completion_tokens = doc.usage.completion_tokens or 0,
            total_tokens      = doc.usage.total_tokens or 0,
            cost              = doc.usage.cost,   -- nil for local
            model             = doc.model or model_cfg.model,
        }
        -- Don't emit yet — the [DONE] event marks stream end; emit
        -- once we exit the curl.post_sse loop so the caller sees
        -- usage as the LAST event in the stream order.
    end
    -- ... existing text + tool_call handling ...
end

-- After curl.post_sse returns (stream complete):
if final_usage then on_delta("usage", final_usage) end

Outbound include_usage

local body_table = { model = ..., messages = ..., stream = true }
if opts.include_usage ~= false then
    body_table.stream_options = { include_usage = true }
end

Risk: some providers reject unrecognized fields. Baseline check; if any host throws on stream_options, the per-model opt-out is one line.

Category tagging

opts.category is a string set by the caller. broker echoes it into the emitted usage payload so the accumulator knows what to credit. Default category if absent: "main".


5. Pillar 2 — Accumulator on ctx

Shape

ctx.usage_totals = {
    -- [model_name] = { [category] = { prompt = N, completion = N,
    --                                 calls = N, cost = N } }
    fast = {
        main      = { prompt = 1234, completion = 567, calls = 14, cost = 0   },
    },
    cloud = {
        main      = { prompt = 3850, completion = 980, calls = 8,  cost = 0.0180 },
        delegate  = { prompt = 250,  completion = 80,  calls = 1,  cost = 0.0012 },
        probe     = { prompt = 150,  completion = 30,  calls = 1,  cost = 0.0042 },
    },
}
ctx.cost_warn_fired = false

add_usage

function Context:add_usage(model, category, u)
    model    = model    or "?"
    category = category or "main"
    self.usage_totals = self.usage_totals or {}
    local m = self.usage_totals[model] or {}
    local c = m[category] or { prompt = 0, completion = 0, calls = 0, cost = 0 }
    c.prompt     = c.prompt     + (u.prompt_tokens or 0)
    c.completion = c.completion + (u.completion_tokens or 0)
    c.calls      = c.calls      + 1
    c.cost       = c.cost       + (u.cost or 0)
    m[category] = c
    self.usage_totals[model] = m
end

function Context:total_cost()
    local total = 0
    for _, m in pairs(self.usage_totals or {}) do
        for _, c in pairs(m) do total = total + c.cost end
    end
    return total
end

function Context:total_tokens()
    local p, comp = 0, 0
    for _, m in pairs(self.usage_totals or {}) do
        for _, c in pairs(m) do
            p    = p    + c.prompt
            comp = comp + c.completion
        end
    end
    return p, comp
end

Reset semantics

Context:reset() deliberately does NOT clear usage_totals — matches R8 invariant from Phase 6 (:reset clears turns, pending_exec_output, summary; preserves memory_items, project, and now usage_totals). The user reset their conversation, not their cost meter. :cost reset is the explicit reset verb for the meter.


6. Pillar 3 — :cost meta

:cost                       summary line
:cost detail                per-model + per-category breakdown
:cost reset                 zero out ctx.usage_totals + cost_warn_fired

Summary format:

[aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens
                       cost=$0.0234 (cloud only; local: 0)

Detail format (sorted by total cost desc, then by model):

[aish] session usage detail:
  cloud     main      8 calls,  3,850 / 980 tokens,   $0.0180
  cloud     delegate  1 call,     250 / 80  tokens,   $0.0012
  cloud     probe     1 call,     150 / 30  tokens,   $0.0042
  fast      main     14 calls,  8,200 / 2,100 tokens, $0     (local)

Implementation: pure Lua iteration over ctx.usage_totals; no broker calls. Sorting uses table.sort on a flattened list.


7. Pillar 4 — Warning thresholds

Config:

cost = {
    warn_at_dollars = 0.50,    -- emit once when cumulative cost crosses
    warn_at_tokens  = 100000,  -- emit once when cumulative tokens crosses
}

After every ctx:add_usage, check:

if config.cost and not ctx.cost_warn_fired then
    local cost = ctx:total_cost()
    if config.cost.warn_at_dollars and cost >= config.cost.warn_at_dollars then
        renderer.status(("session cost $%.4f has crossed warn_at_dollars=$%.4f")
                        :format(cost, config.cost.warn_at_dollars))
        ctx.cost_warn_fired = true
    end
    -- (similar for warn_at_tokens; share the flag or use two)
end

One-shot per session. :cost reset clears the flag.


8. UX Surface Summary

Meta Behavior
:cost One-line summary: calls / tokens / cost
:cost detail Per-model + per-category breakdown
:cost reset Zero out totals + clear warn-fired flag
Config Default Effect
cfg.cost.warn_at_dollars nil Status when cumulative cost first crosses this dollar amount
cfg.cost.warn_at_tokens nil Status when cumulative total tokens first crosses
(broker opts.include_usage) true Adds stream_options.include_usage = true to outbound request

9. Out of Scope (Phase 7)

  • Cross-session cost persistence — Q-C2 defers <history.dir>/cost.jsonl rollup; v1 is session-only. Per-turn usage IS in the session JSONL for after-the-fact aggregation if anyone wants to script it.
  • Per-model rate limiting / cost caps that REFUSE the call — v1 only warns. A future phase could add a hard cap that aborts before the broker call.
  • Pricing-table fallback for local models — if a local model doesn't emit usage.cost, we record 0. Estimating cost from token count + a static pricing table is a future polish (most users won't care about local "cost" anyway — local is free).
  • Pretty token-bandwidth charts / sparklines — out of scope; the detail breakdown is text-only.
  • Estimated cost for future turns — no preflight cost prediction.
  • MCP tool-call usage — MCP servers don't expose token usage; broker calls invoked DURING MCP tool dispatch ARE captured (because they go through the same path), but the MCP tool call itself isn't.

10. Risks

Risk Mitigation
Some providers reject stream_options -> SSE errors at the top of the stream opts.include_usage = false opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior
OpenRouter cost field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) Capture usage.cost as-is (number); document that the same provider must be used for cross-call comparison
Local llama.cpp returns no cost -> displayed $0 could mislead user "is this REALLY free?" :cost detail annotates local lines with (local) literal; summary says cost=$X (cloud only; local: 0)
ctx.usage_totals grows unboundedly with new model names mid-session Bounded by #models in config × #categories — small constants. No mitigation needed.
Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold Acceptable for v1; user can :cost reset to re-arm. Future polish: warn at each Nx multiple.

11. Open Questions (Phase 7)

# Question Impact Resolution target
Q-C1 What to do when a provider doesn't emit usage (local llama.cpp, some misconfigured proxies)? Record zero / silently skip / estimate via char/4? Accumulator entries for those models Analyze (probe what hossenfelder actually returns per model class)
Q-C2 Should <history.dir>/cost.jsonl accumulate cross-session totals? :cost --all-time reporter? New file + persistence layer Defer to follow-up phase (v1 = session only)
Q-C3 Categories vs free-form tags — should opts.category be free-form (caller decides) or validated against a closed set? Predictability of :cost detail output Analyze
Q-C4 Does the hossenfelder broker forward stream_options to all backends? Some backends may strip it. include_usage default Baseline (real probe)
Q-C5 Should cfg.cost.warn_at_dollars trigger on the FIRST broker call that crosses, or on the NEXT one (so user sees the call complete before the warn)? UX detail Analyze
Q-C6 When :reset is called, should cost_warn_fired be cleared too (so the next warn-cross fires again)? Or only clear via :cost reset? Reset surface area Analyze

12. Phase 7 → Phase 8+ Out-of-band

Candidate follow-ups (non-binding):

  • Phase 8: cross-session cost persistence (Q-C2 deferral), with optional cost dashboards / weekly rollup reporter.
  • Hard rate limits / cost caps that REFUSE the call — an extension of the warn surface that promotes warnings into preflight enforcement.
  • Better tokenization (Q1 deferred-from-Phase-3): replace the char/4 heuristic on Context:estimate_tokens() with model /tokenize calls. Indirectly improves accuracy of any future "preflight cost predictor".

Phase 7 itself is self-contained — no upstream dependencies.