Files
aish/docs/PHASE7.md
T
marfrit f0bccdec48 docs/PHASE7: analyze — probe broker surface + resolve Qs in-place
Status: Formulate -> Analyze (tree at 3bad07b probed).

11 findings (A1-A11), 5/6 open Qs resolved (Q-C4 deferred to baseline):

A1.  broker.chat_stream surface clean — usage capture via closure-local
     + on_delta("usage") emission after curl.post_sse returns.
A2.  7 caller sites for opts.category threading (probe / norris /
     summarize / main / delegate x2 / memory_summarize).
A3.  build_request signature widens to (model_cfg, msgs, stream, opts)
     to absorb tools / max_tokens / include_usage / stream_options
     without further positional growth.
A4.  Q-C3 RESOLVED: free-form categories (caller decides); matches
     Phase 6 helpers/skills convention.
A5.  Q-C5 RESOLVED: warn fires on the call that crossed (no NEXT-call
     delay).
A6.  Q-C6 RESOLVED: :reset does NOT clear cost_warn_fired; only
     :cost reset clears.
A7.  Norris call-graph rewires (commit 955bd82) — secrets streaming
     rehydrator wraps only "text" kind; new "usage" kind passes
     through unchanged. No new entanglement.
A8.  ctx.usage_totals survives :reset (R8 parity with memory_items,
     project).
A9.  Session JSONL inherits the new field automatically (dkjson
     opaque encoding).
A10. Q-C1 PARTIAL: defensive silent skip when provider omits usage.
     Real probe required for local model — baseline action.
A11. Q-C4 deferred to baseline (real broker probe).

§2 build_request row updated to mention the A3 refactor.
§11 Open Qs table now shows all 6 with resolutions; only Q-C4
remains as a baseline-time probe.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:49:03 +00:00

22 KiB
Raw Blame History

aish — Phase 7 Manifest

Project: aish — AI-augmented conversational shell Document: Phase 7 Requirements, Architecture & Design Decisions Status: Analyze (formulate complete; tree at 3bad07b probed) Date: 2026-05-16

Analyze findings (2026-05-16):

A1. broker.chat_stream surface is clean for the extension. The existing on_event(data) closure inside M.chat_stream already parses doc.error / doc.choices / delta / tool_calls — adding if doc.usage then final_usage = ... end is one block. Emission happens via a closure-local final_usage that the post-loop code in chat_stream reads and calls on_delta("usage", final_usage) on. build_request needs minor extension OR (cleaner) chat_stream inserts stream_options.include_usage = true into the body table AFTER json.encode — but we currently encode in build_request. Cleanest: extend build_request(model_cfg, messages, stream, opts) so it can read opts.include_usage. Phase 7 simplifies the signature in passing.

A2. 7 caller sites identified for opts.category threading:

| Site | Category |
|---|---|
| `safety.lua:191` (LLM probe) | `"probe"` |
| `safety.lua:354` (norris main) | `"norris"` |
| `repl.lua:326` (summarize-on-evict) | `"summarize"` |
| `repl.lua:685` (call_broker wrapper, used by ask_ai) | `"main"` |
| `repl.lua:1104` (DELEGATE: handler) | `"delegate"` |
| `repl.lua:1587` (:memory summarize) | `"memory_summarize"` |
| `repl.lua:2156` (:delegate meta) | `"delegate"` |

All callers pass `opts` already; adding a `category` field is
additive and backward-compatible (default to `"main"` when absent).

A3. build_request signature simplification. Today it takes (model_cfg, messages, stream, tools, max_tokens) — five positional args. With Phase 7 needing include_usage AND stream_options, positional growth gets unwieldy. Resolution: widen to (model_cfg, messages, stream, opts) where opts carries {tools, max_tokens, include_usage, stream_options}. Callers in M.chat_stream and M.chat pass their existing opts table through. This is a refactor but contained inside broker.lua.

A4. Q-C3 RESOLVED: free-form categories. The closed-set vs free-form debate resolved in favor of free-form per the helpers/skills convention already in place (Phase 6 :tree / :diff metas don't validate sub-args either). :cost detail will show whatever categories appear — small + documented closed set in practice (7 entries from A2), no surprise.

A5. Q-C5 RESOLVED: warn fires on the call that crossed. The crossed call's usage IS in the accumulator at the moment we check (we check AFTER add_usage). Firing on the NEXT call would mean a delay of one full broker round-trip before the user sees the warn — defeats the purpose. Just emit-on-cross.

A6. Q-C6 RESOLVED: :reset does NOT clear cost_warn_fired. Parity with usage_totals itself (per the §2 decision row); the user reset their conversation, not their cost meter. The flag AND the totals are reset only by the explicit :cost reset verb.

A7. Norris call-graph rewires (existing safety.lua:354 path): with issue #52 wired (commit 955bd82), the Norris broker call now passes helpers.scrub_msgs / helpers.streaming_rehydrator. The on_delta wrapping pattern means I need to be careful that the new ("usage", payload) kind also flows through any wrapper. Since secrets streaming_rehydrator only matches on kind == "text", the "usage" kind passes through unchanged. No new entanglement.

A8. ctx.usage_totals survives :reset per R8 — same invariant as memory_items (Phase 4) and project (Phase 6). Documented in §5 of the manifest; reinforces the "ambient context survives conversation reset" rule.

A9. Session JSONL serialization — assistant turn dict gets an optional usage field. history.lua log_turn currently calls json.encode(turn) opaquely; the dkjson serializer handles nested tables. No code change needed; the new field flows through automatically when the assistant turn carries one.

A10. Q-C1 PARTIAL: local providers may not emit usage. The formulate-time assumption was "treat absence as zero-cost / unknown". A real probe against qwen-coder-7b-snappy-8k is a baseline action — see B-probes below. The implementation will be defensive: if doc.usage never appears in the stream, no "usage" event is emitted, and the accumulator is unchanged for that turn. :cost output naturally reflects "0 calls counted for local model" if that's the case.

A11. Q-C4 deferred to baseline: actual stream_options forwarding by the hossenfelder proxy must be probed against a live broker. If the proxy strips the option, we get no usage events even for cloud calls. Baseline action.

PHASE0 is the locked substrate; PHASE1-6 are layered on top. This manifest specifies what Phase 7 adds — cost / usage observability: the ability to know, mid-session, how many tokens you've spent and how much money the paid-cloud calls have cost.

PHASE0 §11 originally listed phases only through 6; this commit amends §11 to add Phase 7.


1. Scope of Phase 7

Four pillars:

  1. Usage capture in brokerbroker.chat_stream extracts the provider's usage block (and cost where present) from the response stream. Surfaces it to the caller via a new on_delta("usage", ...) kind. The existing broker.chat buffering wrapper exposes it as a second return value (text, usage). Backward-compatible: callers that don't handle the new kind / second value simply ignore it.

  2. Per-session accumulator on ctx — running totals per-model AND per-call-category (main / delegate / summarize / probe) accumulate on ctx.usage_totals. No persistence across sessions in v1 (Q-C2 defers cross-session); the session-log JSONL files DO carry per-turn usage so historical analysis is possible after the fact.

  3. :cost meta — a :cost reporter that shows the current session totals, with optional :cost detail for the per-model + per-category breakdown. Zero broker calls (purely local read of ctx.usage_totals).

  4. Optional warning thresholdscfg.cost.warn_at_dollars and cfg.cost.warn_at_tokens emit a status the first time the running total crosses the configured threshold. Default off (no warnings without config). Useful when cloud presets are configured and you want a "you've spent $1 this session" nudge before runaway cost.

Phase 7 is done when:

  • broker.chat_stream exposes usage via the new on_delta("usage", ...) callback kind; broker.chat returns (text, usage). Backward compat preserved (no existing caller breaks).
  • After a session with mixed local + cloud calls, :cost prints a total like:
    [aish] session usage: 24 turns, prompt=12,450 / completion=3,210 tokens
                                    cost=$0.0234 (cloud only; local: 0)
    
  • :cost detail breaks down by model + category:
    fast    main: 14 turns, 8200/2100 tokens
    cloud   main: 8 turns, 3850/980 tokens, $0.0180
    cloud   delegate: 1 turn, 250/80 tokens, $0.0012
    cloud   probe: 1 turn, 150/30 tokens, $0.0042
    
  • Session JSONL gains a usage field on assistant turns (when the broker returned one).
  • With cfg.cost.warn_at_dollars = 0.50 set, crossing $0.50 cumulative emits exactly one status line.
  • Existing configs without cfg.cost behave exactly like Phase 6 (Phase 6 regression coverage).

2. Technology Decisions (delta from Phase 6)

Decision Choice Rationale
Where to extract usage In broker.chat_stream event loop, looking at each SSE event's usage field on the final chunk The OpenAI streaming spec puts usage on the FINAL chunk when stream_options: { include_usage: true } is in the request body. The Anthropic-via-Bedrock path through OpenRouter respects this; need to verify (baseline).
New on_delta kind on_delta("usage", { prompt_tokens, completion_tokens, total_tokens, cost?, model?, native_finish_reason? }) Mirrors the existing ("text", chunk) / ("tool_call", call) shape. Callers ignore unknown kinds; backward-compatible.
Where to enable usage on the wire opts.include_usage = true (default true) sets stream_options.include_usage = true in the outbound request body Off-switch for hosts that reject stream_options. Defaults on; baseline probe confirms current broker tolerates it. (A3: build_request signature widens to take an opts table; positional growth was getting unwieldy.)
Accumulator location ctx.usage_totals[model_name][category] table ctx is per-conversation; matches the :reset-survives-or-not rules already in place.
Categories "main" (ask_ai), "delegate", "summarize", "memory_summarize", "probe", "norris" One-tag-per-call-site. Tagged at the caller site (caller passes opts.category to broker.chat_stream).
Cost extraction usage.cost (OpenRouter convention; dollars as a number) plus usage.cost_details.upstream_inference_cost (more detailed). For Anthropic/Bedrock the cost arrives in dollars on usage.cost. For pure local llama.cpp: no cost field — record 0. Single field name across all observed providers (per baseline B7 — to be confirmed).
Cost precision Store as number (Lua double = 53-bit mantissa, ~15 decimal digits — plenty for sub-cent precision) No floating-point cumulative-error concerns at this scale.
Warning trigger First crossing of either threshold emits a single status: [aish] session cost $X.XXXX has crossed warn_at_dollars=$Y.YYYY. Crossed-flag stored on ctx; reset only on session end / :cost reset. One-shot to avoid spamming.
:reset interaction :reset does NOT clear ctx.usage_totals (parity with memory_items/project) — the user reset their conversation, not their cost tracking. :cost reset is the explicit reset verb. Matches R8 invariant from Phase 6.
Session-log persistence Assistant turn entries gain an optional usage field when broker returned one. history.lua log_turn writes it through verbatim. Per-turn granularity preserved for after-the-fact analysis. No new file.

3. Module Changes

File State after Phase 6 Phase 7 changes
broker.lua chat_stream(cfg, msgs, on_delta, opts) with text + tool_call kinds; chat returns text Extract usage from final SSE chunk; emit on_delta("usage", payload); chat returns (text, usage). New opts.include_usage (default true); new opts.category (passed through as a tag in the usage payload).
context.lua system prompt + turns + memory + project + summary Add self.usage_totals (table) + self.cost_warn_fired (bool). New helpers: Context:add_usage(model, category, usage), Context:total_cost(), Context:total_tokens(). Context:reset does NOT clear usage_totals (parity with memory_items / project per R8).
repl.lua ask_ai + delegate + summarize callbacks + Norris helpers Wire opts.category at each broker call site (main / delegate / summarize / memory_summarize). Wire on_delta("usage", ...) -> ctx:add_usage(...). New :cost and :cost detail / :cost reset metas. Cost-warn check after each add_usage call.
safety.lua norris_step + is_destructive Pass opts.category = "norris" (for the main chat_stream call) and "probe" (for the is_destructive LLM probe). Surfaces probe-cost in the breakdown — useful since safety.llm_model = "cloud" is the recommended setting.
history.lua session.log_turn appends JSONL entries log_turn already takes turn opaquely; assistant turns will carry usage if present and it'll serialize via dkjson. No code change unless filter desired.
config.lua example blocks for mcp/safety/memory/routing/secrets/hooks/project Add commented-out cost = { warn_at_dollars, warn_at_tokens } block.
docs/PHASE0.md §11 lists phases 0-6 Amendment: add Phase 7 row to §11.

No new module files.


4. Pillar 1 — Usage capture in broker

SSE shape (provider-by-provider — confirm in baseline)

For OpenAI-compatible streams with stream_options: { include_usage: true }:

data: {"id":"...","choices":[{"index":0,"delta":{"content":"Hi"}, ...}]}
data: {"id":"...","choices":[{"index":0,"delta":{}, "finish_reason":"stop"}]}
data: {"id":"...","choices":[],"usage":{"prompt_tokens":15,"completion_tokens":3,"total_tokens":18,"cost":0.00004,"cost_details":{...}}}
data: [DONE]

The final usage event arrives AFTER finish_reason but BEFORE [DONE]. choices is empty [] on the usage event.

For non-streaming chat: usage is in the response body at the top level. broker.chat is a wrapper around chat_stream, so it inherits the on_delta path.

For local llama.cpp via hossenfelder: usage may or may not be present depending on the proxy's version. Treat absence as zero-cost / unknown.

Extraction algorithm

local final_usage = nil

local function on_event(data)
    ...
    if doc.usage then
        -- Provider sent usage; capture for emission after the stream.
        final_usage = {
            prompt_tokens     = doc.usage.prompt_tokens or 0,
            completion_tokens = doc.usage.completion_tokens or 0,
            total_tokens      = doc.usage.total_tokens or 0,
            cost              = doc.usage.cost,   -- nil for local
            model             = doc.model or model_cfg.model,
        }
        -- Don't emit yet — the [DONE] event marks stream end; emit
        -- once we exit the curl.post_sse loop so the caller sees
        -- usage as the LAST event in the stream order.
    end
    -- ... existing text + tool_call handling ...
end

-- After curl.post_sse returns (stream complete):
if final_usage then on_delta("usage", final_usage) end

Outbound include_usage

local body_table = { model = ..., messages = ..., stream = true }
if opts.include_usage ~= false then
    body_table.stream_options = { include_usage = true }
end

Risk: some providers reject unrecognized fields. Baseline check; if any host throws on stream_options, the per-model opt-out is one line.

Category tagging

opts.category is a string set by the caller. broker echoes it into the emitted usage payload so the accumulator knows what to credit. Default category if absent: "main".


5. Pillar 2 — Accumulator on ctx

Shape

ctx.usage_totals = {
    -- [model_name] = { [category] = { prompt = N, completion = N,
    --                                 calls = N, cost = N } }
    fast = {
        main      = { prompt = 1234, completion = 567, calls = 14, cost = 0   },
    },
    cloud = {
        main      = { prompt = 3850, completion = 980, calls = 8,  cost = 0.0180 },
        delegate  = { prompt = 250,  completion = 80,  calls = 1,  cost = 0.0012 },
        probe     = { prompt = 150,  completion = 30,  calls = 1,  cost = 0.0042 },
    },
}
ctx.cost_warn_fired = false

add_usage

function Context:add_usage(model, category, u)
    model    = model    or "?"
    category = category or "main"
    self.usage_totals = self.usage_totals or {}
    local m = self.usage_totals[model] or {}
    local c = m[category] or { prompt = 0, completion = 0, calls = 0, cost = 0 }
    c.prompt     = c.prompt     + (u.prompt_tokens or 0)
    c.completion = c.completion + (u.completion_tokens or 0)
    c.calls      = c.calls      + 1
    c.cost       = c.cost       + (u.cost or 0)
    m[category] = c
    self.usage_totals[model] = m
end

function Context:total_cost()
    local total = 0
    for _, m in pairs(self.usage_totals or {}) do
        for _, c in pairs(m) do total = total + c.cost end
    end
    return total
end

function Context:total_tokens()
    local p, comp = 0, 0
    for _, m in pairs(self.usage_totals or {}) do
        for _, c in pairs(m) do
            p    = p    + c.prompt
            comp = comp + c.completion
        end
    end
    return p, comp
end

Reset semantics

Context:reset() deliberately does NOT clear usage_totals — matches R8 invariant from Phase 6 (:reset clears turns, pending_exec_output, summary; preserves memory_items, project, and now usage_totals). The user reset their conversation, not their cost meter. :cost reset is the explicit reset verb for the meter.


6. Pillar 3 — :cost meta

:cost                       summary line
:cost detail                per-model + per-category breakdown
:cost reset                 zero out ctx.usage_totals + cost_warn_fired

Summary format:

[aish] session usage: 24 calls, prompt=12,450 / completion=3,210 tokens
                       cost=$0.0234 (cloud only; local: 0)

Detail format (sorted by total cost desc, then by model):

[aish] session usage detail:
  cloud     main      8 calls,  3,850 / 980 tokens,   $0.0180
  cloud     delegate  1 call,     250 / 80  tokens,   $0.0012
  cloud     probe     1 call,     150 / 30  tokens,   $0.0042
  fast      main     14 calls,  8,200 / 2,100 tokens, $0     (local)

Implementation: pure Lua iteration over ctx.usage_totals; no broker calls. Sorting uses table.sort on a flattened list.


7. Pillar 4 — Warning thresholds

Config:

cost = {
    warn_at_dollars = 0.50,    -- emit once when cumulative cost crosses
    warn_at_tokens  = 100000,  -- emit once when cumulative tokens crosses
}

After every ctx:add_usage, check:

if config.cost and not ctx.cost_warn_fired then
    local cost = ctx:total_cost()
    if config.cost.warn_at_dollars and cost >= config.cost.warn_at_dollars then
        renderer.status(("session cost $%.4f has crossed warn_at_dollars=$%.4f")
                        :format(cost, config.cost.warn_at_dollars))
        ctx.cost_warn_fired = true
    end
    -- (similar for warn_at_tokens; share the flag or use two)
end

One-shot per session. :cost reset clears the flag.


8. UX Surface Summary

Meta Behavior
:cost One-line summary: calls / tokens / cost
:cost detail Per-model + per-category breakdown
:cost reset Zero out totals + clear warn-fired flag
Config Default Effect
cfg.cost.warn_at_dollars nil Status when cumulative cost first crosses this dollar amount
cfg.cost.warn_at_tokens nil Status when cumulative total tokens first crosses
(broker opts.include_usage) true Adds stream_options.include_usage = true to outbound request

9. Out of Scope (Phase 7)

  • Cross-session cost persistence — Q-C2 defers <history.dir>/cost.jsonl rollup; v1 is session-only. Per-turn usage IS in the session JSONL for after-the-fact aggregation if anyone wants to script it.
  • Per-model rate limiting / cost caps that REFUSE the call — v1 only warns. A future phase could add a hard cap that aborts before the broker call.
  • Pricing-table fallback for local models — if a local model doesn't emit usage.cost, we record 0. Estimating cost from token count + a static pricing table is a future polish (most users won't care about local "cost" anyway — local is free).
  • Pretty token-bandwidth charts / sparklines — out of scope; the detail breakdown is text-only.
  • Estimated cost for future turns — no preflight cost prediction.
  • MCP tool-call usage — MCP servers don't expose token usage; broker calls invoked DURING MCP tool dispatch ARE captured (because they go through the same path), but the MCP tool call itself isn't.

10. Risks

Risk Mitigation
Some providers reject stream_options -> SSE errors at the top of the stream opts.include_usage = false opt-out per call site; baseline-time probe of the actual hossenfelder broker behavior
OpenRouter cost field shape varies between providers (Bedrock vs. Baidu vs. Together vs. ...) Capture usage.cost as-is (number); document that the same provider must be used for cross-call comparison
Local llama.cpp returns no cost -> displayed $0 could mislead user "is this REALLY free?" :cost detail annotates local lines with (local) literal; summary says cost=$X (cloud only; local: 0)
ctx.usage_totals grows unboundedly with new model names mid-session Bounded by #models in config × #categories — small constants. No mitigation needed.
Warn threshold fires once and never again for a long-running session that crosses 2x / 10x the threshold Acceptable for v1; user can :cost reset to re-arm. Future polish: warn at each Nx multiple.

11. Open Questions (Phase 7)

# Question Impact Resolution target
Q-C1 Provider-without-usage handling A10 — defensive silent skip; baseline probe will confirm shape on local llama.cpp.
Q-C2 Cross-session cost persistence (cost.jsonl) Deferred to follow-up phase 8; v1 is session-only.
Q-C3 Categories closed-set vs free-form A4 — free-form; caller decides. Matches Phase 6 helpers/skills convention.
Q-C4 stream_options forwarding by hossenfelder Baseline — probe required against the live broker.
Q-C5 Warn fires on the crossed call or the next A5 — on the crossed call (no UX-defeating delay).
Q-C6 :reset clears cost_warn_fired A6 — no, only :cost reset clears the flag (R8 parity).

12. Phase 7 → Phase 8+ Out-of-band

Candidate follow-ups (non-binding):

  • Phase 8: cross-session cost persistence (Q-C2 deferral), with optional cost dashboards / weekly rollup reporter.
  • Hard rate limits / cost caps that REFUSE the call — an extension of the warn surface that promotes warnings into preflight enforcement.
  • Better tokenization (Q1 deferred-from-Phase-3): replace the char/4 heuristic on Context:estimate_tokens() with model /tokenize calls. Indirectly improves accuracy of any future "preflight cost predictor".

Phase 7 itself is self-contained — no upstream dependencies.