docs/PHASE7-baseline: live broker probes for usage shape

Real probes against hossenfelder.fritz.box:8082 against both backends. Five findings, all align with the formulate/analyze design — no structural changes. B1. `stream_options.include_usage = true` is safely accepted by both backends. REQUIRED for local llama.cpp to emit usage; no-op for cloud (which emits anyway). Default-true is correct. B2. Two emission patterns observed: - Cloud (Bedrock): usage rides the FINAL delta chunk with non-empty `choices` carrying finish_reason. - Local: usage rides a SEPARATE chunk with `choices: []` preceding `[DONE]`. Both shapes are handled by the same `if doc.usage then ...` check; the existing on_event choices-branch short-circuits safely when choices is empty. B3. `cost` field is dollar-denominated (number) and cloud-only. Local returns `timings` instead (perf, not cost). Accumulator captures `usage.cost` as-is; nil treated as 0. :cost detail annotates local lines so $0 isn't misread. B4. `doc.model` in the usage event reflects the upstream-API-version (e.g., Bedrock rewrites `anthropic/claude-haiku-4.5` to `anthropic/claude-4.5-haiku-20251001`). Accumulator keys by caller-intended `model_cfg.model`, NOT `doc.model`, for stable cross-call comparison. B5. Usage event is always the LAST data event before `[DONE]`. Emission of `on_delta("usage", ...)` happens after curl.post_sse returns — one call per stream, after all text + tool_calls. Q-C4 RESOLVED: hossenfelder forwards `stream_options.include_usage` to all backends correctly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 22:49:53 +00:00
parent f0bccdec48
commit 2244a3f1ee
1 changed files with 155 additions and 0 deletions
@@ -0,0 +1,155 @@
 # Phase 7 Baseline — pre-implementation measurements
 **Date:** 2026-05-16
 **Tree probed:** `f0bccde` (PHASE7 formulate + analyze).
 **Broker probed:** `hossenfelder.fritz.box:8082` (local `qwen-coder-7b-snappy-8k`, cloud `anthropic/claude-haiku-4.5`).
 This is the Phase 7 (verify) anchor for the cost/usage observability
 work. Captures the world just before broker.lua / context.lua / repl.lua
 edits land.
 ---
 ## B1. `stream_options.include_usage = true` is safely accepted everywhere
 Probed both backends with and without the flag in the request body:
 | Backend | Without flag | With flag | Notes |
 |---|---|---|---|
 | Cloud (Anthropic via Bedrock through OpenRouter) | usage IS in final chunk | usage IS in final chunk | OpenRouter emits usage by default; the flag is a no-op there |
 | Local llama.cpp (qwen-coder-7b-snappy-8k via hossenfelder) | NO usage emitted | usage IS in final chunk | The flag is **required** for local; hossenfelder forwards it correctly to llama.cpp |
 **Implication for §2 / §4:** the formulate-time decision to default
 `opts.include_usage = true` is correct. Without the flag we'd silently
 miss local-model usage tracking. With the flag both backends emit
 `usage` reliably. No need for a per-backend opt-out in v1.
 ---
 ## B2. Usage payload shape — TWO emission patterns
 **Cloud (Anthropic/Bedrock):** usage rides the FINAL delta chunk that
 ALSO carries the closing `finish_reason`. `choices` is non-empty.
 ```json
 {
  "id": "gen-...",
  "object": "chat.completion.chunk",
  "model": "anthropic/claude-4.5-haiku-20251001",
  "provider": "Amazon Bedrock",
  "choices": [{
    "index": 0,
    "delta": { "content": "", "role": "assistant" },
    "finish_reason": "length"
  }],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 4,
    "total_tokens": 12,
    "cost": 0.000028,                                  // dollars
    "cost_details": {
      "upstream_inference_cost": 0.000028,
      "upstream_inference_prompt_cost": 0.000008,
      "upstream_inference_completions_cost": 0.00002
    },
    "prompt_tokens_details": { "cached_tokens": 0, "cache_write_tokens": 0, ... },
    "completion_tokens_details": { "reasoning_tokens": 0, ... }
  }
 }
 ```
 **Local (llama.cpp):** usage rides a SEPARATE final chunk where
 `choices: []`. Then `[DONE]` marker.
 ```json
 {
  "id": "chatcmpl-...",
  "object": "chat.completion.chunk",
  "model": "qwen-coder-7b-snappy-8k",
  "choices": [],
  "usage": {
    "prompt_tokens": 30,
    "completion_tokens": 6,
    "total_tokens": 36,
    "prompt_tokens_details": { "cached_tokens": 29 }
  },
  "timings": {
    "cache_n": 29, "prompt_n": 1, "prompt_ms": 152.391,
    "predicted_n": 6, "predicted_ms": 758.778, ...
  }
 }
 data: [DONE]
 ```
 **Implication for §4 extraction algorithm:** `if doc.usage then
 final_usage = doc.usage end` works for BOTH shapes (cloud-style
 non-empty-choices chunk AND local-style empty-choices chunk). The
 existing on_event branch on `choices and choices[1] and delta` is
 short-circuited safely when choices is empty.
 ---
 ## B3. `cost` field is dollar-denominated and present on cloud only
 | Provider | `usage.cost` | `usage.cost_details` |
 |---|---|---|
 | Anthropic via Bedrock (OpenRouter) | ✓ (number, USD) | ✓ (upstream_inference_cost / _prompt_cost / _completions_cost) |
 | Local llama.cpp | absent | absent |
 The local model has `timings` instead — useful for perf observability
 but NOT cost. **Implication:** in the accumulator, capture
 `usage.cost` as-is when present; treat `nil` as 0 (matches the
 formulate-time "local: free" framing). `:cost detail` annotates
 local lines as `(local)` so the displayed `$0` isn't misread.
 ---
 ## B4. Model identifier in usage events — choose source carefully
 Cloud's usage event carries:
 - `doc.model = "anthropic/claude-4.5-haiku-20251001"` (the resolved upstream-API-version)
 But the REQUEST was `"model": "anthropic/claude-haiku-4.5"`. The
 broker / OpenRouter rewrote the model name to the dated version.
 **Implication:** the accumulator should key by the CALLER-INTENDED model
 name (i.e., `model_cfg.model` from the request, NOT `doc.model` from the
 response). This keeps `:cost detail` output stable across upstream API
 version bumps. Documented in §5 of the manifest already (uses
 `model_name`).
 For local the two match (model_cfg.model == doc.model), so this is a
 cloud-only consideration.
 ---
 ## B5. Multi-chunk vs single-chunk delivery
 Cloud (Bedrock) returns the whole 4-token response in ~3 chunks (median
 27 chars each per B2 of Phase 6 baseline). Local returns ~6 chunks of
 ~4 chars each. In both cases the `usage` event is the LAST data event
 before `[DONE]`. So the post-`curl.post_sse` emission of
 `on_delta("usage", ...)` in chat_stream is the right place to fire —
 it happens once per stream, after all text/tool_calls have been
 delivered.
 ---
 ## Summary
 | Finding | Affects | Resolution |
 |---|---|---|
 | B1 stream_options safe + required for local | §4 `opts.include_usage` default | Default true; no per-backend opt-out needed |
 | B2 two emission patterns (non-empty vs empty choices) | broker.on_event branch | `if doc.usage then final_usage = doc.usage end` works for both |
 | B3 cost dollar-denominated, cloud-only | accumulator + :cost detail | Capture as-is; nil→0; annotate local lines |
 | B4 model identifier rewrite by upstream | accumulator keying | Key by `model_cfg.model` (caller-intended) not `doc.model` |
 | B5 usage is last event before [DONE] | emission placement | Fire `on_delta("usage", ...)` after curl.post_sse returns |
 All findings align with the formulate/analyze design. No structural
 changes needed. The implementation can proceed to plan.
 **Q-C4 RESOLVED** (was: does the hossenfelder broker forward
 `stream_options` to all backends?): YES — local llama.cpp receives
 and honors the flag; cloud emits usage with or without (the flag is
 a no-op there). Both confirmed via real probes against
 `hossenfelder.fritz.box:8082`.