aish/docs/PHASE7-baseline.md

# Phase 7 Baseline — pre-implementation measurements

**Date:** 2026-05-16
**Tree probed:** `f0bccde` (PHASE7 formulate + analyze).
**Broker probed:** `hossenfelder.fritz.box:8082` (local `qwen-coder-7b-snappy-8k`, cloud `anthropic/claude-haiku-4.5`).

This is the Phase 7 (verify) anchor for the cost/usage observability
work. Captures the world just before broker.lua / context.lua / repl.lua
edits land.

---

## B1. `stream_options.include_usage = true` is safely accepted everywhere

Probed both backends with and without the flag in the request body:

| Backend | Without flag | With flag | Notes |
|---|---|---|---|
| Cloud (Anthropic via Bedrock through OpenRouter) | usage IS in final chunk | usage IS in final chunk | OpenRouter emits usage by default; the flag is a no-op there |
| Local llama.cpp (qwen-coder-7b-snappy-8k via hossenfelder) | NO usage emitted | usage IS in final chunk | The flag is **required** for local; hossenfelder forwards it correctly to llama.cpp |

**Implication for §2 / §4:** the formulate-time decision to default
`opts.include_usage = true` is correct. Without the flag we'd silently
miss local-model usage tracking. With the flag both backends emit
`usage` reliably. No need for a per-backend opt-out in v1.

---

## B2. Usage payload shape — TWO emission patterns

**Cloud (Anthropic/Bedrock):** usage rides the FINAL delta chunk that
ALSO carries the closing `finish_reason`. `choices` is non-empty.

```json
{
  "id": "gen-...",
  "object": "chat.completion.chunk",
  "model": "anthropic/claude-4.5-haiku-20251001",
  "provider": "Amazon Bedrock",
  "choices": [{
    "index": 0,
    "delta": { "content": "", "role": "assistant" },
    "finish_reason": "length"
  }],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 4,
    "total_tokens": 12,
    "cost": 0.000028,                                  // dollars
    "cost_details": {
      "upstream_inference_cost": 0.000028,
      "upstream_inference_prompt_cost": 0.000008,
      "upstream_inference_completions_cost": 0.00002
    },
    "prompt_tokens_details": { "cached_tokens": 0, "cache_write_tokens": 0, ... },
    "completion_tokens_details": { "reasoning_tokens": 0, ... }
  }
}
```

**Local (llama.cpp):** usage rides a SEPARATE final chunk where
`choices: []`. Then `[DONE]` marker.

```json
{
  "id": "chatcmpl-...",
  "object": "chat.completion.chunk",
  "model": "qwen-coder-7b-snappy-8k",
  "choices": [],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 6,
    "total_tokens": 36,
    "prompt_tokens_details": { "cached_tokens": 29 }
  },
  "timings": {
    "cache_n": 29, "prompt_n": 1, "prompt_ms": 152.391,
    "predicted_n": 6, "predicted_ms": 758.778, ...
  }
}
data: [DONE]
```

**Implication for §4 extraction algorithm:** `if doc.usage then
final_usage = doc.usage end` works for BOTH shapes (cloud-style
non-empty-choices chunk AND local-style empty-choices chunk). The
existing on_event branch on `choices and choices[1] and delta` is
short-circuited safely when choices is empty.

---

## B3. `cost` field is dollar-denominated and present on cloud only

| Provider | `usage.cost` | `usage.cost_details` |
|---|---|---|
| Anthropic via Bedrock (OpenRouter) | ✓ (number, USD) | ✓ (upstream_inference_cost / _prompt_cost / _completions_cost) |
| Local llama.cpp | absent | absent |

The local model has `timings` instead — useful for perf observability
but NOT cost. **Implication:** in the accumulator, capture
`usage.cost` as-is when present; treat `nil` as 0 (matches the
formulate-time "local: free" framing). `:cost detail` annotates
local lines as `(local)` so the displayed `$0` isn't misread.

---

## B4. Model identifier in usage events — choose source carefully

Cloud's usage event carries:
- `doc.model = "anthropic/claude-4.5-haiku-20251001"` (the resolved upstream-API-version)

But the REQUEST was `"model": "anthropic/claude-haiku-4.5"`. The
broker / OpenRouter rewrote the model name to the dated version.

**Implication:** the accumulator should key by the CALLER-INTENDED model
name (i.e., `model_cfg.model` from the request, NOT `doc.model` from the
response). This keeps `:cost detail` output stable across upstream API
version bumps. Documented in §5 of the manifest already (uses
`model_name`).

For local the two match (model_cfg.model == doc.model), so this is a
cloud-only consideration.

---

## B5. Multi-chunk vs single-chunk delivery

Cloud (Bedrock) returns the whole 4-token response in ~3 chunks (median
27 chars each per B2 of Phase 6 baseline). Local returns ~6 chunks of
~4 chars each. In both cases the `usage` event is the LAST data event
before `[DONE]`. So the post-`curl.post_sse` emission of
`on_delta("usage", ...)` in chat_stream is the right place to fire —
it happens once per stream, after all text/tool_calls have been
delivered.

---

## Summary

| Finding | Affects | Resolution |
|---|---|---|
| B1 stream_options safe + required for local | §4 `opts.include_usage` default | Default true; no per-backend opt-out needed |
| B2 two emission patterns (non-empty vs empty choices) | broker.on_event branch | `if doc.usage then final_usage = doc.usage end` works for both |
| B3 cost dollar-denominated, cloud-only | accumulator + :cost detail | Capture as-is; nil→0; annotate local lines |
| B4 model identifier rewrite by upstream | accumulator keying | Key by `model_cfg.model` (caller-intended) not `doc.model` |
| B5 usage is last event before [DONE] | emission placement | Fire `on_delta("usage", ...)` after curl.post_sse returns |

All findings align with the formulate/analyze design. No structural
changes needed. The implementation can proceed to plan.

**Q-C4 RESOLVED** (was: does the hossenfelder broker forward
`stream_options` to all backends?): YES — local llama.cpp receives
and honors the flag; cloud emits usage with or without (the flag is
a no-op there). Both confirmed via real probes against
`hossenfelder.fritz.box:8082`.