From 2244a3f1eea0c1323aa08940f21d7a98a8d6ad03 Mon Sep 17 00:00:00 2001 From: Markus Fritsche Date: Sat, 16 May 2026 22:49:53 +0000 Subject: [PATCH] docs/PHASE7-baseline: live broker probes for usage shape MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Real probes against hossenfelder.fritz.box:8082 against both backends. Five findings, all align with the formulate/analyze design — no structural changes. B1. `stream_options.include_usage = true` is safely accepted by both backends. REQUIRED for local llama.cpp to emit usage; no-op for cloud (which emits anyway). Default-true is correct. B2. Two emission patterns observed: - Cloud (Bedrock): usage rides the FINAL delta chunk with non-empty `choices` carrying finish_reason. - Local: usage rides a SEPARATE chunk with `choices: []` preceding `[DONE]`. Both shapes are handled by the same `if doc.usage then ...` check; the existing on_event choices-branch short-circuits safely when choices is empty. B3. `cost` field is dollar-denominated (number) and cloud-only. Local returns `timings` instead (perf, not cost). Accumulator captures `usage.cost` as-is; nil treated as 0. :cost detail annotates local lines so $0 isn't misread. B4. `doc.model` in the usage event reflects the upstream-API-version (e.g., Bedrock rewrites `anthropic/claude-haiku-4.5` to `anthropic/claude-4.5-haiku-20251001`). Accumulator keys by caller-intended `model_cfg.model`, NOT `doc.model`, for stable cross-call comparison. B5. Usage event is always the LAST data event before `[DONE]`. Emission of `on_delta("usage", ...)` happens after curl.post_sse returns — one call per stream, after all text + tool_calls. Q-C4 RESOLVED: hossenfelder forwards `stream_options.include_usage` to all backends correctly. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/PHASE7-baseline.md | 155 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 155 insertions(+) create mode 100644 docs/PHASE7-baseline.md diff --git a/docs/PHASE7-baseline.md b/docs/PHASE7-baseline.md new file mode 100644 index 0000000..d8ed817 --- /dev/null +++ b/docs/PHASE7-baseline.md @@ -0,0 +1,155 @@ +# Phase 7 Baseline — pre-implementation measurements + +**Date:** 2026-05-16 +**Tree probed:** `f0bccde` (PHASE7 formulate + analyze). +**Broker probed:** `hossenfelder.fritz.box:8082` (local `qwen-coder-7b-snappy-8k`, cloud `anthropic/claude-haiku-4.5`). + +This is the Phase 7 (verify) anchor for the cost/usage observability +work. Captures the world just before broker.lua / context.lua / repl.lua +edits land. + +--- + +## B1. `stream_options.include_usage = true` is safely accepted everywhere + +Probed both backends with and without the flag in the request body: + +| Backend | Without flag | With flag | Notes | +|---|---|---|---| +| Cloud (Anthropic via Bedrock through OpenRouter) | usage IS in final chunk | usage IS in final chunk | OpenRouter emits usage by default; the flag is a no-op there | +| Local llama.cpp (qwen-coder-7b-snappy-8k via hossenfelder) | NO usage emitted | usage IS in final chunk | The flag is **required** for local; hossenfelder forwards it correctly to llama.cpp | + +**Implication for §2 / §4:** the formulate-time decision to default +`opts.include_usage = true` is correct. Without the flag we'd silently +miss local-model usage tracking. With the flag both backends emit +`usage` reliably. No need for a per-backend opt-out in v1. + +--- + +## B2. Usage payload shape — TWO emission patterns + +**Cloud (Anthropic/Bedrock):** usage rides the FINAL delta chunk that +ALSO carries the closing `finish_reason`. `choices` is non-empty. + +```json +{ + "id": "gen-...", + "object": "chat.completion.chunk", + "model": "anthropic/claude-4.5-haiku-20251001", + "provider": "Amazon Bedrock", + "choices": [{ + "index": 0, + "delta": { "content": "", "role": "assistant" }, + "finish_reason": "length" + }], + "usage": { + "prompt_tokens": 8, + "completion_tokens": 4, + "total_tokens": 12, + "cost": 0.000028, // dollars + "cost_details": { + "upstream_inference_cost": 0.000028, + "upstream_inference_prompt_cost": 0.000008, + "upstream_inference_completions_cost": 0.00002 + }, + "prompt_tokens_details": { "cached_tokens": 0, "cache_write_tokens": 0, ... }, + "completion_tokens_details": { "reasoning_tokens": 0, ... } + } +} +``` + +**Local (llama.cpp):** usage rides a SEPARATE final chunk where +`choices: []`. Then `[DONE]` marker. + +```json +{ + "id": "chatcmpl-...", + "object": "chat.completion.chunk", + "model": "qwen-coder-7b-snappy-8k", + "choices": [], + "usage": { + "prompt_tokens": 30, + "completion_tokens": 6, + "total_tokens": 36, + "prompt_tokens_details": { "cached_tokens": 29 } + }, + "timings": { + "cache_n": 29, "prompt_n": 1, "prompt_ms": 152.391, + "predicted_n": 6, "predicted_ms": 758.778, ... + } +} +data: [DONE] +``` + +**Implication for §4 extraction algorithm:** `if doc.usage then +final_usage = doc.usage end` works for BOTH shapes (cloud-style +non-empty-choices chunk AND local-style empty-choices chunk). The +existing on_event branch on `choices and choices[1] and delta` is +short-circuited safely when choices is empty. + +--- + +## B3. `cost` field is dollar-denominated and present on cloud only + +| Provider | `usage.cost` | `usage.cost_details` | +|---|---|---| +| Anthropic via Bedrock (OpenRouter) | ✓ (number, USD) | ✓ (upstream_inference_cost / _prompt_cost / _completions_cost) | +| Local llama.cpp | absent | absent | + +The local model has `timings` instead — useful for perf observability +but NOT cost. **Implication:** in the accumulator, capture +`usage.cost` as-is when present; treat `nil` as 0 (matches the +formulate-time "local: free" framing). `:cost detail` annotates +local lines as `(local)` so the displayed `$0` isn't misread. + +--- + +## B4. Model identifier in usage events — choose source carefully + +Cloud's usage event carries: +- `doc.model = "anthropic/claude-4.5-haiku-20251001"` (the resolved upstream-API-version) + +But the REQUEST was `"model": "anthropic/claude-haiku-4.5"`. The +broker / OpenRouter rewrote the model name to the dated version. + +**Implication:** the accumulator should key by the CALLER-INTENDED model +name (i.e., `model_cfg.model` from the request, NOT `doc.model` from the +response). This keeps `:cost detail` output stable across upstream API +version bumps. Documented in §5 of the manifest already (uses +`model_name`). + +For local the two match (model_cfg.model == doc.model), so this is a +cloud-only consideration. + +--- + +## B5. Multi-chunk vs single-chunk delivery + +Cloud (Bedrock) returns the whole 4-token response in ~3 chunks (median +27 chars each per B2 of Phase 6 baseline). Local returns ~6 chunks of +~4 chars each. In both cases the `usage` event is the LAST data event +before `[DONE]`. So the post-`curl.post_sse` emission of +`on_delta("usage", ...)` in chat_stream is the right place to fire — +it happens once per stream, after all text/tool_calls have been +delivered. + +--- + +## Summary + +| Finding | Affects | Resolution | +|---|---|---| +| B1 stream_options safe + required for local | §4 `opts.include_usage` default | Default true; no per-backend opt-out needed | +| B2 two emission patterns (non-empty vs empty choices) | broker.on_event branch | `if doc.usage then final_usage = doc.usage end` works for both | +| B3 cost dollar-denominated, cloud-only | accumulator + :cost detail | Capture as-is; nil→0; annotate local lines | +| B4 model identifier rewrite by upstream | accumulator keying | Key by `model_cfg.model` (caller-intended) not `doc.model` | +| B5 usage is last event before [DONE] | emission placement | Fire `on_delta("usage", ...)` after curl.post_sse returns | + +All findings align with the formulate/analyze design. No structural +changes needed. The implementation can proceed to plan. + +**Q-C4 RESOLVED** (was: does the hossenfelder broker forward +`stream_options` to all backends?): YES — local llama.cpp receives +and honors the flag; cloud emits usage with or without (the flag is +a no-op there). Both confirmed via real probes against +`hossenfelder.fritz.box:8082`.